linux-kernel - Re: [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Yv/gUCt+ayp/KSoO@monkey>
Date:   Fri, 19 Aug 2022 12:11:12 -0700
From:   Mike Kravetz <mike.kravetz@...cle.com>
To:     Miaohe Lin <linmiaohe@...wei.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Muchun Song <songmuchun@...edance.com>,
        Linux-MM <linux-mm@...ck.org>,
        linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: [bug report] mm/hugetlb: various bugs with avoid_reserve case in
 alloc_huge_page()

On 08/19/22 15:20, Miaohe Lin wrote:
> On 2022/8/19 6:43, Mike Kravetz wrote:
> > On 08/17/22 16:31, Miaohe Lin wrote:
> >> Hi all:
> >>     When I investigate the mm/hugetlb.c code again, I found there are a few possible issues
> >> with avoid_reserve case. (It's really hard to follow the relevant code for me.) Please take
> >> a look at the below analysis:
> > 
> > Thank you for taking a close look at this code!
> > 
> > I agree that the code is hard to follow.  I have spent many hours/days/weeks
> > chasing down the cause of incorrect reservation counts.  I imagine there could
> > be more issues, especially when you add the uncommon avoid_reserve and
> > MAP_NORESERVE processing.
> 
> Many thanks for your time and reply, Mike!

Well, hugetlb reservations interrupted my sleep again :)  See below.

> > 
> >> 1.avoid_reserve issue with h->resv_huge_pages in alloc_huge_page.
> > 
> > Did you actually see this issue, or is it just based on code inspection?
> 
> No, it's based on code inspection. ;)
> 
> > I tried to recreate, but could not.  When looking closer, this may not
> > even be possible.
> > 
> >>     Assume:
> >> 	h->free_huge_pages 60
> >> 	h->resv_huge_pages 30
> >> 	spool->rsv_hpages  30
> > 
> > OK.
> > 
> >>
> >>     When avoid_reserve is true, after alloc_huge_page(), we will have:
> > 
> > Take a close look at the calling paths for alloc_huge_page when avoid_reserve
> > is true.  There are only two such call paths.
> > 1) copy_hugetlb_page_range - We allocate pages in the 'early COW' processing.
> >    In such cases, the pages are private and not associated with a file, or
> >    filesystem or subpool (spool).  Therefore, there should be no spool
> >    modifications.
> 
> Agree.
> 
> > 2) hugetlb_wp (formerly called hugetlb_cow) - Again, we are allocating a
> >    private page and should not be modifying spool.
> 
> Agree.
> 
> > 
> > If the above is correct, then we will not modify spool->rsv_hpages which
> > leads to the inconsistent results.
> 
> I missed to verify whether spool will be modified in avoid_reserve case. Sorry about that.
> 

That is how it SHOULD work.  However, there is a problem with a MAP_PRIVATE
mapping of a hugetlb file.  In this case, subpool_vma will return the
subpool associated with the file, and we will end up with a leaked reservation
as in your example.  I have verified with test code.

The first thought I had is that something like the following should be added.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 474bfbe9929e..5aa19574e890 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -254,7 +258,9 @@ static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
 
 static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
 {
-	return subpool_inode(file_inode(vma->vm_file));
+	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+		return subpool_inode(file_inode(vma->vm_file));
+	return NULL;		/* no subpool for private mappings */
 }
 
 /* Helper that removes a struct file_region from the resv_map cache and returns


That certainly addresses the MAP_PRIVATE mapping of a hugetlb file issue.
I will collect up patches for issues we discover and submit together.

> > It is confusing that MAP_NORESERVE does not imply avoid_reserve will be
> > passed to alloc_huge_page.
> 
> It's introduced to guarantee that COW faults for a process that called mmap(MAP_PRIVATE) will succeed via commit
> 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed").
> It seems it has nothing to do with MAP_NORESERVE.
> 
> > 
> >> 	spool->rsv_hpages  29 /* hugepage_subpool_get_pages decreases it. */
> >> 	h->free_huge_pages 59
> >> 	h->resv_huge_pages 30 /* rsv_hpages is used, but *h->resv_huge_pages is not modified accordingly*. */
> >>
> >>     If the hugetlb page is freed later, we will have:
> >> 	spool->rsv_hpages  30 /* hugepage_subpool_put_pages increases it. */
> >> 	h->free_huge_pages 60
> >> 	h->resv_huge_pages 31 /* *increased wrongly* due to hugepage_subpool_put_pages(spool, 1) == 0. */
> >> 			   ^^
> >>
> > 
> > I'll take a closer look at 2 and 3 when we determine if 1 is a possible
> > issue or not.
> 
> I want to propose removing the avoid_reserve code. When called from above case 1) or 2), vma_needs_reservation()
> will always return 1 as there's no reservation for it. Also hugepage_subpool_get_pages() will always return 1 as
> it's not associated with a spool. So when avoid_reserve == true, map_chg and gbl_chg must be 1 and vma_has_reserves()
> will always return "false". As a result, passing in avoid_reserve == true will do nothing in fact. So it can be simply
> removed. Or am I miss something again?

I will take a closer look.  But, at a high level if avoid_reserve == true and
all pages are reserved we must fail the allocation or attempt dynamic
allocation if overcommit is allowed.  So, it seems we at least need the
flag to make this decision.
-- 
Mike Kravetz

> 
> If avoid_reserve code can be removed, below issue 2 and 3 won't be possible as they rely on avoid_reserve doing its work.
> 
> Thanks!
> Miaohe Lin