lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a01349f5-a870-4a13-a4cc-00aba007348f@lucifer.local>
Date: Wed, 19 Nov 2025 12:08:44 +0000
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Nico Pache <npache@...hat.com>
Cc: linux-kernel@...r.kernel.org, linux-trace-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-doc@...r.kernel.org, david@...hat.com,
        ziy@...dia.com, baolin.wang@...ux.alibaba.com, Liam.Howlett@...cle.com,
        ryan.roberts@....com, dev.jain@....com, corbet@....net,
        rostedt@...dmis.org, mhiramat@...nel.org,
        mathieu.desnoyers@...icios.com, akpm@...ux-foundation.org,
        baohua@...nel.org, willy@...radead.org, peterx@...hat.com,
        wangkefeng.wang@...wei.com, usamaarif642@...il.com,
        sunnanyong@...wei.com, vishal.moola@...il.com,
        thomas.hellstrom@...ux.intel.com, yang@...amperecomputing.com,
        kas@...nel.org, aarcange@...hat.com, raquini@...hat.com,
        anshuman.khandual@....com, catalin.marinas@....com, tiwai@...e.de,
        will@...nel.org, dave.hansen@...ux.intel.com, jack@...e.cz,
        cl@...two.org, jglisse@...gle.com, surenb@...gle.com,
        zokeefe@...gle.com, hannes@...xchg.org, rientjes@...gle.com,
        mhocko@...e.com, rdunlap@...radead.org, hughd@...gle.com,
        richard.weiyang@...il.com, lance.yang@...ux.dev, vbabka@...e.cz,
        rppt@...nel.org, jannh@...gle.com, pfalcato@...e.de
Subject: Re: [PATCH v12 mm-new 12/15] khugepaged: Introduce mTHP collapse
 support

Oh I forgot to add -

In collapse_scan_pmd() there are casees where you just bail out altogether.

E.g.: pte_uffd_wp() for _any_ PTE entry in the range.

Or !folio_test_anon() for _any_ PTE entry in the range.

Etc.

Surely these are cases where an mTHP scan on part of the range might still
succeed?

You then in the subseuqent patch seem to check for collapse failures
specifically due to some of these, but surely you will never hit them as you
already discarded the whole PTE page table?

I'm not sure you've updated collapse_scan_pmd() sufficiently to account for the
mTHP logic.

Cheers, Lorenzo


On Wed, Nov 19, 2025 at 11:53:16AM +0000, Lorenzo Stoakes wrote:
> On Wed, Oct 22, 2025 at 12:37:14PM -0600, Nico Pache wrote:
> > During PMD range scanning, track occupied pages in a bitmap. If mTHPs are
> > enabled we remove the restriction of max_ptes_none during the scan phase
> > to avoid missing potential mTHP candidates.
>
> It's a bit odd to open the commit message with a very specific
> implementation detail, I think we should instead open with a broad
> description of what we intend here, e.g. to permit mTHP collapse, before:
>
> - Discussing the algorithm used (in more detail than below!)
> - How and under what circumstances this algorithm is invoked
> - (Mention MADV_COLLAPSE not supporting mTHP as of yet)
> - THEN super-specific details like this.
>
> >
> > Implement collapse_scan_bitmap() to perform binary recursion on the bitmap
> > and determine the best eligible order for the collapse. A stack struct is
> > used instead of traditional recursion. The algorithm splits the bitmap
> > into smaller chunks to find the best fit mTHP.  max_ptes_none is scaled by
> > the attempted collapse order to determine how "full" an order must be
> > before being considered for collapse.
>
> I feel this is a _very_ brief description of a complicated algorithm. I
> think we should go into a lot more detail here. 'Binary recursion' is pretty
> hand-wavey, and you go from hand waving that to being super-specific about
> max_ptes_none before handwaving about 'fullness' of an order.
>
> All in all I find it super confusing - so I think you need to take a step
> back and 'explain it to me like I'm 5' here :)
>
> >
> > Once we determine what mTHP sizes fits best in that PMD range a collapse
> > is attempted. A minimum collapse order of 2 is used as this is the lowest
> > order supported by anon memory.
>
> I don't know what 'lowest order supported by anon memory' means?
>
> I guess really this is the minimum order contptes support for arm64 right?
>
> >
> > mTHP collapses reject regions containing swapped out or shared pages.
> > This is because adding new entries can lead to new none pages, and these
> > may lead to constant promotion into a higher order (m)THP. A similar
> > issue can occur with "max_ptes_none > HPAGE_PMD_NR/2" due to a collapse
> > introducing at least 2x the number of pages, and on a future scan will
> > satisfy the promotion condition once again. This issue is prevented via
> > the collapse_allowable_orders() function.
>
> Obviously this has been discussed to death, but you should update this to
> reflect the decided upon course (0, 511 + warning, etc.).
>
> >
> > Currently madv_collapse is not supported and will only attempt PMD
> > collapse.
>
> Good to highlight this.
>
> >
> > We can also remove the check for is_khugepaged inside the PMD scan as
> > the collapse_max_ptes_none() function handles this logic now.
>
> Again we're kind of leaping from mega handwaving to super-specific details
> :) let's make it all a lot more specific + clear, and then put the really
> niche details like this at the end of the commit msg (I mean this one is
> fine where it is ofc as a result :)
>
> >
> > Signed-off-by: Nico Pache <npache@...hat.com>
> > ---
> >  include/linux/khugepaged.h |   2 +
> >  mm/khugepaged.c            | 128 ++++++++++++++++++++++++++++++++++---
> >  2 files changed, 122 insertions(+), 8 deletions(-)
> >
> > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h
> > index eb1946a70cff..179ce716e769 100644
> > --- a/include/linux/khugepaged.h
> > +++ b/include/linux/khugepaged.h
> > @@ -1,6 +1,8 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #ifndef _LINUX_KHUGEPAGED_H
> >  #define _LINUX_KHUGEPAGED_H
> > +#define KHUGEPAGED_MIN_MTHP_ORDER	2
> > +#define MAX_MTHP_BITMAP_STACK	(1UL << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER))
>
> This is an internal implementation detail, I don't think we need this in a
> header do we? I think this define should be in khugepaged.c.
>
> Also this is a really fiddly and confusing value, I don't think it's a good idea
> to just put this here without explanation.
>
> It's not even clear what it is. I'd probably rename it to MTHP_STACK_SIZE?
>
> We need a comment that explains how you're deriving this, something like:
>
> /*
>  * In order to determine mTHP order, we consider every possible mTHP size
>  * starting with MAX_PTRS_PER_PTE PTE entries and stopping at
>  * 2^KHUGEPAGED_MIN_THP_ORDER.
>  *
>  * We store (offset, order) pairs on the stack to do so, each describing a
>  * candidate mTHP collapse.
>  *
>  * For each (offset, order) candidate mTHP range that we consider, we must
>  * also consider candiate mTHPs at (offset, order - 1), and (offset + (1 <<
>  * order), order - 1).
>  *
>  *
>  * This is because each order can be split into two (an order expresses the
>  * power-of-two size), so we examine each half of the next lower order
>  * mTHP:
>  *
>  *                offset   mid_offset
>  *                  .          |
>  *                  .          v
>  *  |---------------.-------------------|
>  *  |           PTE page table          |
>  *  |---------------.-------------------|
>  *                   <--------><-------->
>  *                     order-1   order-1
>  *
>  * Given we must consider the range of KHUGEPAGED_MIN_MTHP_ORDER to
>  * MAX_PTRS_PER_PTE number of PTE entries, this is the same as saying we
>  * must consider KHUGEPAGED_MIN_MTHP_ORDER to lg2(MAX_PTRS_PER_PTE) mTHP
>  * orders.
>  *
>  * As we must consider 2 possible mTHP ranges for each order, this requires
>  * our stack to be 2^n, where n is the number of orders we must consider.
>  *
>  * And thus MTHP_STACK_SIZE is 2^(lg2(MAX_PTRS_PER_PTE) -
>  * KHUGEPAGED_MIN_MTHP_ORDER).
>  */
>
> This may seem (very) long-winded, but this is all really non-obvious.
>
> You can additionally rephrase this and utilise it in the commit message,
> description of the iterative recursion function and possibly elsewhere to
> describe the algorithm more clearly.
>
> In fact, since this should really be declared in khugepaged.c, and since
> you can place it just before the mthp collapse function, you could expand
> this to describe the algorithm as a whole and simply put the define and the
> function immediately next to each other after the comment?
>
> >
> >  #include <linux/mm.h>
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 89a105124790..e2319bfd0065 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -93,6 +93,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS);
> >
> >  static struct kmem_cache *mm_slot_cache __ro_after_init;
> >
> > +struct scan_bit_state {
>
> Scan bit state is a bit of a weird name. Scanning what? What bit? State is
> kind of implied?
>
> struct order_offset_pair?
>
> struct mthp_range?
>
> > +	u8 order;
> > +	u16 offset;
>
> Real mega nit, but feels more natural to put offset first here. As
> (position, size) seems more naturally the way to view this than (size,
> position).
>
> > +};
> > +
>
> Also needs comments...? Order of what? Offset in what?
>
> >  struct collapse_control {
> >  	bool is_khugepaged;
> >
> > @@ -101,6 +106,13 @@ struct collapse_control {
> >
> >  	/* nodemask for allocation fallback */
> >  	nodemask_t alloc_nmask;
> > +
> > +	/*
> > +	 * bitmap used to collapse mTHP sizes.
> > +	 */
>
> Nit but this should be on one line /* Bitmap used to collapse mTHP sizes */.
>
> But we're not storing sizes though are we? And we're declaring two bitmaps?
>
> > +	 DECLARE_BITMAP(mthp_bitmap, HPAGE_PMD_NR);
>
> Really this is more of a PTE table bitmap but probably fine to call it this.
>
> > +	 DECLARE_BITMAP(mthp_bitmap_mask, HPAGE_PMD_NR);
>
> You've added random whitespace after the tab twice here? [tab][space]DECLARE_...
>
> > +	struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_STACK];
> >  };
> >
> >  /**
> > @@ -1357,6 +1369,85 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long pmd_address,
> >  	return result;
> >  }
> >
> > +static void push_mthp_bitmap_stack(struct collapse_control *cc, int *top,
> > +				   u8 order, u16 offset)
>
> Not sure we need to say mthp_bitmap_stack everywhere. This is a local
> static function we can be a little more succinct.
>
> mthp_stack_push()?
>
> > +{
> > +	cc->mthp_bitmap_stack[++*top] = (struct scan_bit_state)
> > +		{ order, offset };
>
> This feels rather difficult to read, cc->mthp_bitmap_stack[++*top] in
> particular is rather too succinct.
>
> This would be better more broken out, e.g.:
>
> static void mthp_stack_push(struct collapse_control *cc, int *sizep,
> 	    u8 order, u16 offset)
> {
> 	const int size = *sizep;
> 	struct scan_bit_state *stack = &cc->mthp_bitmap_stack[size];
>
> 	VM_WARN_ON_ONCE(idx >= MAX_MTHP_BITMAP_STACK);
> 	stack->order = order;
> 	stack->offset = offset;
> 	*sizep++;
> }
>
> (Note this requires the change I suggest below re: not defaulting top to -1
> but instead renaming it to stack_size and starting at 0, see below).
>
> Re: below comment having pop as a helper too, that can be:
>
> static struct scan_bit_state mthp_stack_pop(struct collapse_control *cc,
> 		int *sizep)
> {
> 	const int size = *sizep;
>
> 	VM_WARN_ON_ONCE(size <= 0);
> 	*sizep--;
> 	return cc->mthp_bitmap_stack[size - 1];
> }
>
> > +}
> > +
> > +/*
> > + * collapse_scan_bitmap() consumes the bitmap that is generated during
> > + * collapse_scan_pmd() to determine what regions and mTHP orders fit best.
> > + *
> > + * Each bit in the bitmap represents a single occupied (!none/zero) page.
>
> In which bitmap? There are 2 that are declared. Be specific - cc->mthp_bitmap.
>
> > + * A stack structure cc->mthp_bitmap_stack is used to check different regions
>
> > + * of the bitmap for collapse eligibility. We start at the PMD order and
> > + * check if it is eligible for collapse; if not, we add two entries to the
>
> I questioned this since you start at HPAGE_PMD_ORDER -
> KHUGEPAGED_MIN_MTHP_ORDER, but then realised you're intentionally
> offsetting like that.
>
> See comments below about changing this.
>
> > + * stack at a lower order to represent the left and right halves of the region.
> > + *
> > + * For each region, we calculate the number of set bits and compare it
> > + * against a threshold derived from collapse_max_ptes_none(). A region is
> > + * eligible if the number of set bits exceeds this threshold.
> > + */
>
> I think we could be clearer here. Something like:
>
> ...
>  * stack at a lower order to represent the left and right halves of the
>  * portion of the PTE page table we are examining.
>  *
>
>  * For each of these, we determine how many PTE entries are occupied in the
>  * range of PTE entries we propose to collapse, then compare this to the
>  * number of PTE entries which would need to be set for a collapse to be
>  * permitted at that order (accounting for max_ptes_none).
>  *
>  * If a collapse is permissible, we attempt to perform one. We do so for
>  * every possible mTHP in the PTE page table.
>  */
>
> > +static int collapse_scan_bitmap(struct mm_struct *mm, unsigned long address,
>
> Really inconsistent naming going on here, we're collapsing and scanning and
> what's the bitmap?
>
> How about mthp_collapse()?
>
> > +		int referenced, int unmapped, struct collapse_control *cc,
> > +		bool *mmap_locked, unsigned long enabled_orders)
> > +{
> > +	u8 order, next_order;
> > +	u16 offset, mid_offset;
> > +	int num_chunks;
> > +	int bits_set, threshold_bits;
> > +	int top = -1;
>
> This seems unnecessary and confusing. Just start at 0 and treat this as the
> exclusive end of the stack.
>
> You can rename this stack_size to make that clear. Have commented above
> about adjustments to push function and introducing pop helper.
>
>
> > +	int collapsed = 0;
> > +	int ret;
> > +	struct scan_bit_state state;
> > +	unsigned int max_none_ptes;
>
> Everywhere else we say max_ptes_none, let's maintain that convention here
> please.
>
> > +
> > +	push_mthp_bitmap_stack(cc, &top, HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0);
>
> See below re: order here, we should change this.
>
> > +
> > +	while (top >= 0) {
> > +		state = cc->mthp_bitmap_stack[top--];
>
> I hate that we have a push helper but then do pop manually. See above.
>
> > +		order = state.order + KHUGEPAGED_MIN_MTHP_ORDER;
>
> OK so now the order isn't state.order but is instead state.order +
> KHUGEPAGED_MIN_MTHP_ORDER? :/ this is extremely confusing.
>
> We shouldn't call this field order if you're doing a hack where state.order
> isn't the order but instead is order - KHUGEPAGED_MIN_MTHP_ORDER.
>
> Just have state.order be the actual order? And change the below condition
> as mentioned there.
>
> > +		offset = state.offset;
> > +		num_chunks = 1UL << order;
>
> What's a chunk? You do need to clarify these things. This is a new term not
> used elsewhere in THP code as far as I can tell?
>
> This is the number of pte entries no?
>
> nr_entries? nr_pte_entries?
>
> > +
> > +		/* Skip mTHP orders that are not enabled */
>
> Note we're also considering PMD here :) Probably we can just delete this
> comment, the code below makes it clear what you're doing.
>
> > +		if (!test_bit(order, &enabled_orders))
> > +			goto next_order;
> > +
> > +		max_none_ptes = collapse_max_ptes_none(order, !cc->is_khugepaged);
>
> OK so this is going to be scaled to order.
>
> > +
> > +		/* Calculate weight of the range */
>
> What's the weight of a range? This isn't a very helpful comment. You're
> counting the Hamming weight or much more clearly - the number of set bits.
>
> So it seems you're simply counting the number of bits you have accumulated
> so far in cc->mthp_bitmap, adding in the latest offset.
>
> So I'd say add a comment saying something like:
>
> /*
>  * Determine how many PTE entries are populated in the range in which we
>  * propose to collapse this mTHP.
>  */
>
> > +		bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> > +		bitmap_set(cc->mthp_bitmap_mask, offset, num_chunks);
> > +		bits_set = bitmap_weight_and(cc->mthp_bitmap,
>
> I think this variable name is pretty horrible, we don't care that it's the
> number of bits set, we care about what it _means_ - that is the number of
> PTE occupied entries.
>
> So nr_occupied_pte_entries? Or nr_occupied_ptes?
>
> > +					     cc->mthp_bitmap_mask, HPAGE_PMD_NR);
>
> Frustrating there isn't a bitmap_weight_offset() or something, as you could
> do that in one go then...
>
> I think this could be made clearer by separating out the gnarly bitmap
> stuff into a helper function:
>
> static int mthp_nr_occupied_pte_entries(struct collapse_control *cc,
> 		struct scan_bit_state state)
> {
> 	const int nr_pte_entries = 1 << state.order;
>
> 	/* Setup cc->mthp_bitmap_mask to contain mask for candidate mTHP. */
> 	bitmap_zero(cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> 	bitmap_set(cc->mthp_bitmap_mask, state.offset, nr_pte_entries);
> 	/* Mask against actually occupied PTE entries in PTE table. */
> 	return bitmap_weight_and(cc->mthp_bitmap,
> 				 cc->mthp_bitmap_mask, HPAGE_PMD_NR);
> }
>
> > +
> > +		threshold_bits = (1UL << order) - max_none_ptes - 1;
>
> We defined num_chunks to 1UL << order then don't use here? :)
>
> I'm not sure we need this to be a separate value, and I don't think we need
> the -1 either, which only confuses matter more.
>
> How about just changing the below conditional to (assuming we've renamed
> num_chunks to something sensible like nr_pte_entries and bits_set to
> nr_occupied_pte_entries):
>
> if (nr_occupied_pte_entries >= nr_pte_entries - max_none_ptes) {
> 	...
> }
>
> > +
> > +		/* Check if the region is eligible based on the threshold */
>
> Probalby don't need this comment with the change above.
>
> > +		if (bits_set > threshold_bits) {
> > +			ret = collapse_huge_page(mm, address, referenced,
> > +						 unmapped, cc, mmap_locked,
> > +						 order, offset);
>
> We declare ret at the top of the function scope, then only use it
> here. That's confusing and unnecessary, just declare it in block scope
> here.
>
> > +			if (ret == SCAN_SUCCEED) {
> > +				collapsed += 1UL << order;
>
> Again we have defined num_chunks or rather nr_pte_entries but then
> open-code 1UL << order, let's just use the value we declared here.
>
> > +				continue;
>
> This is kinda subtle, we don't bother considering lower orders any longer
> *in this range*, but do continue to consider mTHP collapse in other
> portions of the PTE page table.
>
> This shouldn't just be a 'continue' :) we need a comment here to explain
> that.
>
> E.g.:
>
> 	/*
> 	 * We have collapsed an mTHP in this range at the maximum order we
> 	 * could, so we do not push lower orders on to the stack.
> 	 */
> 	 continue;
>
>
> > +			}
> > +		}
> > +
> > +next_order:
> > +		if (state.order > 0) {
>
> This is a great example of how this is confusing by making state.order not
> actually be the order but the order - KHUGEPAGED_MIN_MTHP_ORDER.
>
> Just make the order correct and change this to > KHUGEPAGED_MIN_MTHP_ORDER.
>
> > +			next_order = state.order - 1;
>
> Not sure we should have a label and a variable be the same thing.
>
> Also why are we decl'ing next_order at the top of the function but only using here?
>
> Just declare this here, like:
>
> 	if (state.order > KHUGEPAGED_MIN_MTHP_ORDER) {
> 		const u16 new_order = state.order - 1;
>
> 		...
> 	}
>
> > +			mid_offset = offset + (num_chunks / 2);
> > +			push_mthp_bitmap_stack(cc, &top, next_order, mid_offset);
> > +			push_mthp_bitmap_stack(cc, &top, next_order, offset);
>
> I guess one subtlety that wouldn't be obvious at first glance is that
> num_chunks (oh so badly needs a rename :) is a power-of-2 so we never get
> weird 'what if num_chunks is odd' scenarios to worry about.
>
> Probably doesn't really need a comment though. But this _does_ badly needs
> an ASCII diagram :):
>
> 	/*
> 	 * The next lowest mTHP order possesses half the number of PTE
> 	 * entries of the current one. We therefore must consider both
> 	 * halves of the current mTHP:
> 	 *
> 	 *                offset   mid_offset
> 	 *                  .          |
> 	 *                  .          v
> 	 *  |---------------.-------------------|
> 	 *  |           PTE page table          |
> 	 *  |---------------.-------------------|
> 	 *                   <--------><-------->
> 	 *                     order-1   order-1
> 	 */
>
> Since writing this I copied this above in another suggestion :P so you
> could always say 'see comment above for details' or something.
>
> > +		}
> > +	}
> > +	return collapsed;
> > +}
>
> I've commented this function to death here, but a few more things to note.
>
> (BTW - I'm sorry I personally _hate_ repeated iterations of review when
> there's stuff you could have commented in prior iterations BUT I think I
> may end up having to once we respin due to the subtleties here.)
>
> - I wonder if we could just use a helper struct to make this all a little
>   easier. Perhaps as it's realtively short code not so necesary, but a bit
>   horrid to pass around all these paramters all the time. Maybe something
>   for later THP rework.
>
> - Could we exit early if it's obvious that we won't be able to collapse due
>   to max_ptes_none? I mean for one, we could at least check if the next
>   lowest order is empty. If max_ptes_none was 511, then we would have
>   already collapsed so can surely throw that out?
>
>   I was thinking we could go 'upwards', starting with the lowest order and
>   increasing order (essentially reverse things) then not collapsing until
>   we can't collapse at a given order (so collapse at next lowest). That
>   might be less efficient though.
>
> - Given that we're going to be only permitting max_ptes_none of 0 and 511
>   for mTHP to start with, maybe things can be simplified - either all bits
>   have to 1 or we don't care what they are we attempt colalpse anyway?
>
>   But then again, maybe it's worth having the generic algorithm in place
>   for future flexibility? Thoughts?
>
> - How much have you tested this? This is pretty subtle stuff... it _seems_
>   correct to me logically, but this is crying out for some userland testing
>   that exhaustively throws every possible permutation of state at this
>   function and asserts it's all correct.
>
> - Are we not missing a bunch of stat counts? Didn't we add a bunch but now
>   are actually setting them? E.g. if we reject mTHP candidates due to
>   pte_max_none?
>
> > +
> >  static int collapse_scan_pmd(struct mm_struct *mm,
> >  			     struct vm_area_struct *vma,
> >  			     unsigned long start_addr, bool *mmap_locked,
> > @@ -1364,11 +1455,15 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >  {
> >  	pmd_t *pmd;
> >  	pte_t *pte, *_pte;
> > +	int i;
> >  	int result = SCAN_FAIL, referenced = 0;
> > -	int none_or_zero = 0, shared = 0;
> > +	int none_or_zero = 0, shared = 0, nr_collapsed = 0;
> >  	struct page *page = NULL;
> > +	unsigned int max_ptes_none;
>
> Correct spelling of this :)
>
> >  	struct folio *folio = NULL;
> >  	unsigned long addr;
> > +	unsigned long enabled_orders;
> > +	bool full_scan = true;
> >  	spinlock_t *ptl;
> >  	int node = NUMA_NO_NODE, unmapped = 0;
> >
> > @@ -1378,16 +1473,29 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >  	if (result != SCAN_SUCCEED)
> >  		goto out;
> >
> > +	bitmap_zero(cc->mthp_bitmap, HPAGE_PMD_NR);
> >  	memset(cc->node_load, 0, sizeof(cc->node_load));
> >  	nodes_clear(cc->alloc_nmask);
> > +
> > +	enabled_orders = collapse_allowable_orders(vma, vma->vm_flags, cc->is_khugepaged);
> > +
> > +	/*
> > +	 * If PMD is the only enabled order, enforce max_ptes_none, otherwise
> > +	 * scan all pages to populate the bitmap for mTHP collapse.
> > +	 */
>
> Ugh this is quite ugly. I don't really love that we've converted this from
> doing the actual work to _mostly_ just populating the bitmap for the mthp
> logic.
>
> Then again it's only a couple places where this is checked, but it's pretty
> horrible that what once was _the logic that determines what is being
> considered for THP collapse' is now turned into 'the logic that populates a
> bitmap'.
>
> > +	if (cc->is_khugepaged && enabled_orders == _BITUL(HPAGE_PMD_ORDER))
>
> I think this should be BIT(HPAGE_PMD_ORDER), I realise I reviewed the
> opposite before (or think I did) but as per David we prefer BIT() :)
>
> > +		full_scan = false;
> > +	max_ptes_none = collapse_max_ptes_none(HPAGE_PMD_ORDER, full_scan);
>
> Again really quite nasty, this may as well be:
>
> 	if (cc->is_khugepaged && enabled_orders == BIT(HPAGE_PMD_ORDER))
> 		max_ptes_none = khugepaged_max_ptes_none;
> 	else
> 		max_ptes_none = HPAGE_PMD_NR - 1;
>
> It makes this hack a lot more obvious.
>
> But also, what if !cc->is_khugepaged? We're going to scan everything even
> if we only have PMD? I thought we only considered PMD size for MADV_COLLAPSE?
>
> > +
> >  	pte = pte_offset_map_lock(mm, pmd, start_addr, &ptl);
> >  	if (!pte) {
> >  		result = SCAN_PMD_NULL;
> >  		goto out;
> >  	}
> >
> > -	for (addr = start_addr, _pte = pte; _pte < pte + HPAGE_PMD_NR;
> > -	     _pte++, addr += PAGE_SIZE) {
> > +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +		_pte = pte + i;
> > +		addr = start_addr + i * PAGE_SIZE;
>
> That's nicer. I still hate _pte...
>
> >  		pte_t pteval = ptep_get(_pte);
> >  		if (is_swap_pte(pteval)) {
> >  			++unmapped;
> > @@ -1412,8 +1520,7 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >  		if (pte_none_or_zero(pteval)) {
> >  			++none_or_zero;
> >  			if (!userfaultfd_armed(vma) &&
> > -			    (!cc->is_khugepaged ||
> > -			     none_or_zero <= khugepaged_max_ptes_none)) {
> > +			    none_or_zero <= max_ptes_none) {
>
> Why are we dropping !cc->is_khugepaged?
>
> >  				continue;
> >  			} else {
> >  				result = SCAN_EXCEED_NONE_PTE;
> > @@ -1461,6 +1568,8 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >  			}
> >  		}
> >
> > +		/* Set bit for occupied pages */
> > +		bitmap_set(cc->mthp_bitmap, i, 1);
>
> Maybe worth highlighting this is now _the entire point_ of the loop.
>
> I wonder if we shouldn't just separate this logic out and name it
> apppropriately? As we're into realms of real confusion here.
>
> >  		/*
> >  		 * Record which node the original page is from and save this
> >  		 * information to cc->node_load[].
> > @@ -1517,9 +1626,12 @@ static int collapse_scan_pmd(struct mm_struct *mm,
> >  out_unmap:
> >  	pte_unmap_unlock(pte, ptl);
> >  	if (result == SCAN_SUCCEED) {
> > -		result = collapse_huge_page(mm, start_addr, referenced,
> > -					    unmapped, cc, mmap_locked,
> > -					    HPAGE_PMD_ORDER, 0);
>
> Hmm... what's actually enforcing that MADV_COLLAPSE isn't using this?
> You've not done any cc->khugepaged checks afaict?
>
> It seems that you _are_ enabling this for MADV_COLLAPSE unless I've missed
> something?
>
> > +		nr_collapsed = collapse_scan_bitmap(mm, start_addr, referenced, unmapped,
> > +					      cc, mmap_locked, enabled_orders);
> > +		if (nr_collapsed > 0)
> > +			result = SCAN_SUCCEED;
> > +		else
> > +			result = SCAN_FAIL;
> >  	}
> >  out:
> >  	trace_mm_khugepaged_scan_pmd(mm, folio, referenced,
> > --
> > 2.51.0
> >
>
> Thanks, Lorenzo

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ