linux-kernel - Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <diqzed1ssonz.fsf@ackerleytng-ctop.c.googlers.com>
Date: Fri, 27 Dec 2024 23:15:44 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: peterx@...hat.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org, 
	riel@...riel.com, leitao@...ian.org, akpm@...ux-foundation.org, 
	muchun.song@...ux.dev, osalvador@...e.de, roman.gushchin@...ux.dev, 
	nao.horiguchi@...il.com, stable@...r.kernel.org
Subject: Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio
 from subpool

Ackerley Tng <ackerleytng@...gle.com> writes:

> <snip>
>
> I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.

Yes, after looking into this more deeply, I agree that avoid_reserve
means avoiding the reservations in the resv_map rather than reservations
in the subpool or hstate.

Here's more detail of what's going on in the reproducer that I wrote as I
reviewed Peter's patch:

1. On fallocate(), allocate page A
2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
4. On fork, prep for COW by marking page as read only. Both parent and child share B.
5. On faulting *buf = 2 (write fault), allocate page C, copy B to C
    + B belongs to the child, C belongs to the parent
    + C is owned by the parent
6. Child exits, B is freed
7. On munmap(), C is freed
8. On unlink(), A is freed

When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on
write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is
the root of the bug.

We should decrement h->resv_huge_pages if a reserved page from the subpool was
used, instead of whether avoid_reserve or vma_has_reserves() is set. If
avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we
won't be decrementing h->resv_huge_pages anyway.

I agree with Peter's fix as a whole (the entire patch series).

Reviewed-by: Ackerley Tng <ackerleytng@...gle.com>
Tested-by: Ackerley Tng <ackerleytng@...gle.com>

---

Some definitions which might be helpful:

+ h->resv_huge_pages indicates number of reserved pages globally.
    + This number increases when pages are reserved
    + This number decreases when reserved pages are allocated, or when pages are unreserved
+ spool->rsv_hpages indicates number of reserved pages in this subpool.
    + This number increases when pages are reserved
    + This number decreases when reserved pages are allocated, or when pages are unreserved
+ h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.

More details on the flow in alloc_hugetlb_folio() which might be helpful:

hugepage_subpool_get_pages() returns "the number of pages by which the global
pools must be adjusted (upward)". This return value is never negative other than
errors. (hugepage_subpool_get_pages() always gets called with a positive delta).

Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other
than errors).

If the return value is 0, the subpool had enough reservations and so we should
decrement h->resv_huge_pages.

If the return value is 1, it means that this subpool did not have any more
reserved hugepages, and we need to get a page from the global
hstate. dequeue_hugetlb_folio_vma() will get us a page that was already
allocated.

In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1
page, and there are no available_huge_pages() left, we quit dequeueing since we
will need to allocate a new page. If we want to avoid_reserve, that means we
don't want to use the vma's reserves in resv_map, we also check
available_huge_pages(). If there are available_huge_pages(), we go on to dequeue
a page.

Then, we determine whether to decrement h->resv_huge_pages. We should decrement
if a reserved page from the subpool was used, instead of whether avoid_reserve
or vma_has_reserves() is set.

In the case where a surplus page needs to be allocated, the surplus page isn't
and doesn't need to be associated with a subpool, so no subpool hugepage number
tracking updates are required. h->resv_huge_pages still has to be updated... is
this where h->resv_huge_pages can go negative?