linux-kernel - Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z4asMCC3Jf4qZVo8@x1n>
Date: Tue, 14 Jan 2025 13:25:52 -0500
From: Peter Xu <peterx@...hat.com>
To: Ackerley Tng <ackerleytng@...gle.com>
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org, leitao@...ian.org,
	riel@...riel.com, muchun.song@...ux.dev, nao.horiguchi@...il.com,
	roman.gushchin@...ux.dev, akpm@...ux-foundation.org,
	osalvador@...e.de
Subject: Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting
 when allocate

On Mon, Jan 13, 2025 at 10:57:55PM +0000, Ackerley Tng wrote:
> 
> Hi Peter,

Hi, Ackerley!

> 
> I have an alternative that is based off your patches.

I'll try to comment mostly inline of your patch, that's easier for me.

Side note: having such a detailed explanation before a patch being pasted
is not a good sign of readable code. :) But still, thanks for all these
detailed explanations!  It's definitely not you nor your patch to blame at
all. Frankly, at least I'm also not 100% satisfied with what I got with my
current patch. However, I must confess that's almost the best I can get so
far.. especially the goal to me is to pave way for your gmem work, so I'm
ok as long as we can reduce as much vma reference possible in this
allocation path, which matters more to me, and when working on this patch I
was trying to make my next patch easier (which hopefully it does).  More on
this at the end.

> 
> I like the overall refactoring but was a little uncomfortable with
> having to add a custom enum map_chg_state. The custom enum centralizes
> the possible states, but the states still need to be interpreted again
> throughout the function to take action (e.g. having checks like if
> (map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I
> feel that shifts the problem later to understanding the interpretation
> of the states.
> 
> I switched back to something close to avoid_reserve, but improved on
> that name with your comments, by calling it bypass_vma_reservations. I
> feel that calling it cow_from_owner kind of lets the implementation
> detail of the CoW use-case bleed into this function.
> 
> "bypass_vma_reservations" is named so that it requests this function,
> alloc_hugetlb_folio(), to bypass the resv_map. All parts of
> alloc_hugetlb_folio() that update the resv_map are guarded by the
> bypass_vma_reservations flag.
> 
> This alternative proposes to use the following booleans local to
> alloc_hugetlb_folio().
> 
> 1. vma_reservation_exists
> 
> vma_reservation_exists represents whether a reservation exists in the
> resv_map and is used. vma_reservation_exists defaults to false, and when
> bypass_vma_reservations is not requested, the resv_map will be consulted
> to see if vma_reservation_exists.
> 
> vma_reservation_exists is also used to store the result of the initial
> resv_map check, to correctly fix up a race later on.
> 
> 2. debit_subpool
> 
> If a vma_reservation_exists, then this allocation has been reserved in
> both the resv_map and subpool, so skip debiting the subpool.
> 
> If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the
> subpool should still be charged, so debit_subpool is set.
> 
> And if debit_subpool, then proceed to do hugepage_subpool_get_pages()
> and set up subpool_reservation_exists.
> 
> Later on, debit_subpool is used for the matching cleanup in the error
> case.
> 
> 3. subpool_reservation_exists
> 
> subpool_reservation_exists represents whether a reservation exists in
> the resv_map and is used, analogous to vma_reservation_exists, and the
> subpool is only checked if debit_subpool is set.
> 
> If debit_subpool is not set, vma_reservation_exists determines whether a
> subpool_reservation_exists.
> 
> subpool_reservation_exists is then used to guard decrementing
> h->resv_huge_pages, which fixes the bug you found.
> 
> 4. charge_cgroup_rsvd
> 
> This has the same condition as debit_subpool, but has a separate
> variable for readability.
> 
> Later on, charge_cgroup_rsvd is used to determine whether to commit the
> charge, or whether to fix up the charge in error cases.
> 
> 
> I also refactored dequeue_hugetlb_folio_vma() to
> dequeue_hugetlb_folio_with_mpol() to align with
> alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the
> function.
> 
> gbl_chg is interpreted in alloc_hugetlb_folio() instead. If
> subpool_reservation_exists, then try to get a folio by dequeueing it. If
> a subpool reservation does not exist, make sure there are available
> pages before dequeueing.
> 
> If there was no folio from dequeueing for whatever reason, allocate a
> new folio.
> 
> This could probably be a separate patch but I'd like to hear your
> thoughts before doing integration/cleaning up.
> 
> These changes have been tested with your reproducer, and here's the test
> output from libhugetlbfs test cases:
> 
> ********** TEST SUMMARY
> *                      2M
> *                      32-bit 64-bit
> *     Total testcases:    82     85
> *             Skipped:     9      9
> *                PASS:    72     69
> *                FAIL:     0      0
> *    Killed by signal:     1      7
> *   Bad configuration:     0      0
> *       Expected FAIL:     0      0
> *     Unexpected PASS:     0      0
> *    Test not present:     0      0
> * Strange test result:     0      0
> **********
> 
> 
> Ackerley
> 
> 
> ---
>  mm/hugetlb.c | 186 ++++++++++++++++++++++++++-------------------------
>  1 file changed, 94 insertions(+), 92 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6a0ea28f5bac..2cd588d35984 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h)
>  	return h->free_huge_pages - h->resv_huge_pages;
>  }
> 
> -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> -				struct vm_area_struct *vma,
> -				unsigned long address, long gbl_chg)
> +static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,

Personally the name _with_mpol implies a mempolicy* to be passed in.  In
this case I slightly prefer keeping the name which is shorter, and the vma
variable is the hint to me on that mempolicy will be respected.

So I did notice we also have alloc_buddy_hugetlb_folio_with_mpol(), so IMHO
we could remove that _with_mpol instead, but I don't think that's a huge
deal as of now.

> +						     struct vm_area_struct *vma,
> +						     unsigned long address)
>  {
>  	struct folio *folio = NULL;
>  	struct mempolicy *mpol;
> @@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
>  	nodemask_t *nodemask;
>  	int nid;
> 
> -	/*
> -	 * gbl_chg==1 means the allocation requires a new page that was not
> -	 * reserved before.  Making sure there's at least one free page.
> -	 */
> -	if (gbl_chg && !available_huge_pages(h))
> -		goto err;

Note: I remember when I was working on this series, I also thought about
something like this, but then I found it's easier leaving it as of now.
More below..

> -
>  	gfp_mask = htlb_alloc_mask(h);
>  	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> 
> @@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> 
>  	mpol_cond_put(mpol);
>  	return folio;
> -
> -err:
> -	return NULL;
>  }
> 
>  /*
> @@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
>  	return ret;
>  }
> 
> -typedef enum {
> -	/*
> -	 * For either 0/1: we checked the per-vma resv map, and one resv
> -	 * count either can be reused (0), or an extra needed (1).
> -	 */
> -	MAP_CHG_REUSE = 0,
> -	MAP_CHG_NEEDED = 1,
> -	/*
> -	 * Cannot use per-vma resv count can be used, hence a new resv
> -	 * count is enforced.
> -	 *
> -	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
> -	 * that currently vma_needs_reservation() has an unwanted side
> -	 * effect to either use end() or commit() to complete the
> -	 * transaction.	 Hence it needs to differenciate from NEEDED.
> -	 */
> -	MAP_CHG_ENFORCED = 2,
> -} map_chg_state;
> -
>  /*
> - * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
> - * faults of hugetlb private mappings on top of a non-page-cache folio (in
> - * which case even if there's a private vma resv map it won't cover such
> - * allocation).  New call sites should (probably) never set it to true!!
> - * When it's set, the allocation will bypass all vma level reservations.
> + * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW
> + * faults of hugetlb private mappings need to allocate a new page on top of a
> + * non-page-cache folio. In this situation, even if there's a private vma resv
> + * map, the resv map must be bypassed.  New call sites should (probably) never
> + * set bypass_vma_reservations to true!!
>   */
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -				    unsigned long addr, bool cow_from_owner)
> +				  unsigned long addr, bool bypass_vma_reservations)

One thing to mention is, I renamed it to cow_from_owner explicitly was
trying to "leak" impl detail into this function.  The reason is I wanted to
leak it in a way that people will never use it! :)

It's the same when we could use some weird variable names like:

  variable_you_never_want_to_set_unless_you_know_whats_happening

Otherwise it's not clear if a new caller (consider the fork() early CoW
user) knows some allocation will definitely bypass vma reservation, then
should it set bypass_vma_reservations=true?  The way I named it suggests
people to never set it, to make the flag easier to comprehend and also keep
that special case (then one day maybe we can drop it.. if e.g. we want to
drop the parent-steal-child-page behavior somehow, which is tricky).

PS: Miaohe used to clean one such use of it too at 88ce3fef47f3
("hugetlbfs: remove meaningless variable avoid_reserve"), basically if it's
too general a name people could try to fill something random into it.. and
it can make the special case too hard to follow.

>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct folio *folio;
> -	long retval, gbl_chg;
> -	map_chg_state map_chg;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
>  	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +	bool vma_reservation_exists = false;
> +	bool subpool_reservation_exists;
> +	bool debit_subpool;
> +	bool charge_cgroup_rsvd;
> 
>  	idx = hstate_index(h);
> 
> -	/* Whether we need a separate per-vma reservation? */
> -	if (cow_from_owner) {
> -		/*
> -		 * Special case!  Since it's a CoW on top of a reserved
> -		 * page, the private resv map doesn't count.  So it cannot
> -		 * consume the per-vma resv map even if it's reserved.
> -		 */
> -		map_chg = MAP_CHG_ENFORCED;
> -	} else {
> +	if (!bypass_vma_reservations) {
>  		/*
>  		 * Examine the region/reserve map to determine if the process
> -		 * has a reservation for the page to be allocated.  A return
> -		 * code of zero indicates a reservation exists (no change).
> -		 */
> -		retval = vma_needs_reservation(h, vma, addr);
> -		if (retval < 0)
> +		 * has a reservation for the page to be allocated and debit the
> +		 * reservation.  If npages_req == 0, a reservation exists and is
> +		 * used. If npages_req > 0, a reservation has to be taken either
> +		 * from the subpool or global pool.
> + 		 */
> +		int npages_req = vma_needs_reservation(h, vma, addr);
> +		if (npages_req < 0)
>  			return ERR_PTR(-ENOMEM);
> -		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
> +
> +		vma_reservation_exists = npages_req == 0;
>  	}
> 
>  	/*
> -	 * Whether we need a separate global reservation?
> +	 * If no vma reservation exists, debit the subpool.
>  	 *
> +	 * Even if we were requested to bypass_vma_reservations, debit the
> +	 * subpool - the subpool still has to be charged for this allocation.
> +	 */
> +	debit_subpool = !vma_reservation_exists || bypass_vma_reservations;

Things like this is what I wanted to avoid when cleaning this up, so if you
see after my patch such !AAA || BBB things all gone.  To me it's much
easier to read with a tri-state comparing to keeping such calculations
around.  But now I read your patch, it also proves that this can be very
subjective comment.. so I can understand we can prefer different way to
code this on readablility.

Personally I slightly prefer not having special states for the spools,
simply because IMHO we shouldn't treat spool and the global pool
separately: they should almost represent the same thing, it's just that
spools can exist for some inodes so we try to fetch folios from there first
- in the case either it can have max_pages throttling per-mount
allocations, or min_pages which means we could have pre-reserved folios
already without asking for one in the global pool.

But I don't really have a strong opinion on this.  Again, pretty much
personal preference.. and I respect your preference too.

> +
> +	/*
>  	 * Processes that did not create the mapping will have no
>  	 * reserves as indicated by the region/reserve map. Check
>  	 * that the allocation will not exceed the subpool limit.
>  	 * Or if it can get one from the pool reservation directly.
>  	 */
> -	if (map_chg) {
> -		gbl_chg = hugepage_subpool_get_pages(spool, 1);
> -		if (gbl_chg < 0)
> +	if (debit_subpool) {
> +		int npages_req = hugepage_subpool_get_pages(spool, 1);
> +		if (npages_req < 0)
>  			goto out_end_reservation;
> -	} else {
> +
>  		/*
> -		 * If we have the vma reservation ready, no need for extra
> -		 * global reservation.
> -		 */
> -		gbl_chg = 0;
> +		 * npages_req == 0 indicates a reservation exists for the
> +		 * allocation in the subpool and can be used. npages_req > 0
> +		 * indicates that a reservation must be taken from the global
> +		 * pool.
> + 		 */
> +		subpool_reservation_exists = npages_req == 0;
> +	} else {
> +		/* A vma reservation implies having a subpool reservation. */
> +		subpool_reservation_exists = vma_reservation_exists;
>  	}
> 
>  	/*
> -	 * If this allocation is not consuming a per-vma reservation,
> -	 * charge the hugetlb cgroup now.
> +	 * If no vma reservation exists, charge the cgroup's reserved quota.
> +	 *
> +	 * Even if we were requested to bypass_vma_reservations, the cgroup
> +	 * still has to be charged for this allocation.
>  	 */
> -	if (map_chg) {
> +	charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations;
> +	if (charge_cgroup_rsvd) {
>  		ret = hugetlb_cgroup_charge_cgroup_rsvd(
>  			idx, pages_per_huge_page(h), &h_cg);
>  		if (ret)
> @@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  		goto out_uncharge_cgroup_reservation;
> 
>  	spin_lock_irq(&hugetlb_lock);
> -	/*
> -	 * glb_chg is passed to indicate whether or not a page must be taken
> -	 * from the global free pool (global change).  gbl_chg == 0 indicates
> -	 * a reservation exists for the allocation.
> -	 */
> -	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
> +
> +	if (subpool_reservation_exists) {
> +		folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
> +	} else {
> +		/*
> +		 * Since no subpool_reservation_exists, the allocation requires
> +		 * a new page that was not reserved before.  Only dequeue if
> +		 * there are available pages.
> +		 */
> +		if (available_huge_pages(h)) {
> +			folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);

[1]

> +		} else {
> +			folio = NULL;
> +			/* Fallthrough to allocate a new page. */
> +		}
> +	}

So this could be why I didn't yet move gbl_chg out of
dequeue_hugetlb_folio(), because the condition check can be not as obvious
like this.

Meanwhile I think it's always better we sanity check folio!=NULL at [1],
but then if you see if we need to process folio==NULL anyway, maybe it's
easier we do that as the old code, always be prepared folio==NULL for
whatever reason (e.g. dequeue_hugetlb_folio can change in the future so it
can return NULL for more reasons).

> +
>  	if (!folio) {
>  		spin_unlock_irq(&hugetlb_lock);
>  		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
> @@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	}
> 
>  	/*
> -	 * Either dequeued or buddy-allocated folio needs to add special
> -	 * mark to the folio when it consumes a global reservation.
> +	 * If subpool_reservation_exists (and is used for this allocation),
> +	 * decrement resv_huge_pages to indicate that a reservation was used.
>  	 */
> -	if (!gbl_chg) {
> +	if (subpool_reservation_exists) {
>  		folio_set_hugetlb_restore_reserve(folio);
>  		h->resv_huge_pages--;
>  	}
> 
>  	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
> -	/* If allocation is not consuming a reservation, also store the
> -	 * hugetlb_cgroup pointer on the page.
> -	 */
> -	if (map_chg) {
> +
> +	if (charge_cgroup_rsvd) {
>  		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
>  						  h_cg, folio);
>  	}
> @@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> 
>  	hugetlb_set_folio_subpool(folio, spool);
> 
> -	if (map_chg != MAP_CHG_ENFORCED) {
> -		/* commit() is only needed if the map_chg is not enforced */
> -		retval = vma_commit_reservation(h, vma, addr);
> +	if (!bypass_vma_reservations) {
> +		/*
> +		 * As long as vma reservations were not bypassed, we need to
> +		 * commit() to clear up any adds_in_progress in resv_map.
> +		 */
> +		int ret = vma_commit_reservation(h, vma, addr);
>  		/*
> -		 * Check for possible race conditions. When it happens..
> -		 * The page was added to the reservation map between
> -		 * vma_needs_reservation and vma_commit_reservation.
> -		 * This indicates a race with hugetlb_reserve_pages.
> +		 * If there is a discrepancy in reservation status between the
> +		 * time of vma_needs_reservation() and vma_commit_reservation(),
> +		 * then there the page must have been added to the reservation
> +		 * map between vma_needs_reservation() and
> +		 * vma_commit_reservation().
> +		 *
>  		 * Adjust for the subpool count incremented above AND
>  		 * in hugetlb_reserve_pages for the same page.	Also,
>  		 * the reservation count added in hugetlb_reserve_pages
>  		 * no longer applies.
>  		 */
> -		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
> +		if (unlikely(!vma_reservation_exists && ret == 0)) {
>  			long rsv_adjust;
> 
>  			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>  			hugetlb_acct_memory(h, -rsv_adjust);
> -			if (map_chg) {
> +			if (charge_cgroup_rsvd) {
>  				spin_lock_irq(&hugetlb_lock);
>  				hugetlb_cgroup_uncharge_folio_rsvd(
>  				    hstate_index(h), pages_per_huge_page(h),
> @@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  out_uncharge_cgroup:
>  	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
>  out_uncharge_cgroup_reservation:
> -	if (map_chg)
> +	if (charge_cgroup_rsvd)
>  		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
>  						    h_cg);
>  out_subpool_put:
> -	if (map_chg)
> +	if (debit_subpool)
>  		hugepage_subpool_put_pages(spool, 1);
>  out_end_reservation:
> -	if (map_chg != MAP_CHG_ENFORCED)
> +	if (!bypass_vma_reservations)
>  		vma_end_reservation(h, vma, addr);
>  	return ERR_PTR(-ENOSPC);
>  }
> @@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  	 * be acquired again before returning to the caller, as expected.
>  	 */
>  	spin_unlock(vmf->ptl);
> +
> +	/*
> +	 * If this is a CoW from the owner of this page, we
> +	 * bypass_vma_reservations, since the reservation was already consumed
> +	 * when the hugetlb folio was first allocated before the fork happened.
> +	 */
>  	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
> 
>  	if (IS_ERR(new_folio)) {

Other than that, I see there're quite a few name changes.  I'm ok with
changes that you feel strongly helpful, in that case feel free to split
them into smaller pieces either with your gmem series or post separately
(e.g. which suggests a broader review).

What I'm more interested in though, is how you'd move forward with the
current status quo and suite it with your gmem series.  The hope is that
this series of mine at least can help moving gmem easier (by addressing
some of the major source of vma reference..) to drop vma implications on
the allocations.

I remember in your series you have had your separate spool in gmem, and you
layered it in the way that you'll bypass quite a few things in what
alloc_hugetlb_folio() would do upfront right now.  I don't remember much
details other than that - I still remember I left some cgroup questions to
you, but those were trivial details and I still agree with your layering of
how hugetlb was allocated in general.  So besides how you would like to
further clean this chunk up, the other part on linking this to your gmem
would be more important to me, and whether it's acceptable to upstream that
gmem can use this model to work: basically split hugetlbfs into two chunks,
and allow other modules allocate stuff from hugetlb pools on its own.

Thanks,

-- 
Peter Xu