[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <diqzed1ssonz.fsf@ackerleytng-ctop.c.googlers.com>
Date: Fri, 27 Dec 2024 23:15:44 +0000
From: Ackerley Tng <ackerleytng@...gle.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: peterx@...hat.com, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
riel@...riel.com, leitao@...ian.org, akpm@...ux-foundation.org,
muchun.song@...ux.dev, osalvador@...e.de, roman.gushchin@...ux.dev,
nao.horiguchi@...il.com, stable@...r.kernel.org
Subject: Re: [PATCH 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio
from subpool
Ackerley Tng <ackerleytng@...gle.com> writes:
> <snip>
>
> I'll go over the rest of your patches and dig into the meaning of `avoid_reserve`.
Yes, after looking into this more deeply, I agree that avoid_reserve
means avoiding the reservations in the resv_map rather than reservations
in the subpool or hstate.
Here's more detail of what's going on in the reproducer that I wrote as I
reviewed Peter's patch:
1. On fallocate(), allocate page A
2. On mmap(), set up a vma without VM_MAYSHARE since MAP_PRIVATE was requested
3. On faulting *buf = 1, allocate a new page B, copy A to B because the mmap request was MAP_PRIVATE
4. On fork, prep for COW by marking page as read only. Both parent and child share B.
5. On faulting *buf = 2 (write fault), allocate page C, copy B to C
+ B belongs to the child, C belongs to the parent
+ C is owned by the parent
6. Child exits, B is freed
7. On munmap(), C is freed
8. On unlink(), A is freed
When C was allocated in the parent (owns MAP_PRIVATE page, doing a copy on
write), spool->rsv_hpages was decreased but h->resv_huge_pages was not. This is
the root of the bug.
We should decrement h->resv_huge_pages if a reserved page from the subpool was
used, instead of whether avoid_reserve or vma_has_reserves() is set. If
avoid_reserve is set, the subpool shouldn't be checked for a reservation, so we
won't be decrementing h->resv_huge_pages anyway.
I agree with Peter's fix as a whole (the entire patch series).
Reviewed-by: Ackerley Tng <ackerleytng@...gle.com>
Tested-by: Ackerley Tng <ackerleytng@...gle.com>
---
Some definitions which might be helpful:
+ h->resv_huge_pages indicates number of reserved pages globally.
+ This number increases when pages are reserved
+ This number decreases when reserved pages are allocated, or when pages are unreserved
+ spool->rsv_hpages indicates number of reserved pages in this subpool.
+ This number increases when pages are reserved
+ This number decreases when reserved pages are allocated, or when pages are unreserved
+ h->resv_huge_pages should be the sum of all subpools' spool->rsv_hpages.
More details on the flow in alloc_hugetlb_folio() which might be helpful:
hugepage_subpool_get_pages() returns "the number of pages by which the global
pools must be adjusted (upward)". This return value is never negative other than
errors. (hugepage_subpool_get_pages() always gets called with a positive delta).
Specifically in alloc_hugetlb_folio(), the return value is either 0 or 1 (other
than errors).
If the return value is 0, the subpool had enough reservations and so we should
decrement h->resv_huge_pages.
If the return value is 1, it means that this subpool did not have any more
reserved hugepages, and we need to get a page from the global
hstate. dequeue_hugetlb_folio_vma() will get us a page that was already
allocated.
In dequeue_hugetlb_folio_vma(), if the vma doesn't have enough reserves for 1
page, and there are no available_huge_pages() left, we quit dequeueing since we
will need to allocate a new page. If we want to avoid_reserve, that means we
don't want to use the vma's reserves in resv_map, we also check
available_huge_pages(). If there are available_huge_pages(), we go on to dequeue
a page.
Then, we determine whether to decrement h->resv_huge_pages. We should decrement
if a reserved page from the subpool was used, instead of whether avoid_reserve
or vma_has_reserves() is set.
In the case where a surplus page needs to be allocated, the surplus page isn't
and doesn't need to be associated with a subpool, so no subpool hugepage number
tracking updates are required. h->resv_huge_pages still has to be updated... is
this where h->resv_huge_pages can go negative?
Powered by blists - more mailing lists