linux-kernel - Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55cc1cef-1ece-48c3-af17-22ece8df5ed3@lucifer.local>
Date: Mon, 19 May 2025 10:04:43 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Dev Jain <dev.jain@....com>
Cc: akpm@...ux-foundation.org, Liam.Howlett@...cle.com, vbabka@...e.cz,
        jannh@...gle.com, pfalcato@...e.de, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org, david@...hat.com, peterx@...hat.com,
        ryan.roberts@....com, mingo@...nel.org, libang.li@...group.com,
        maobibo@...ngson.cn, zhengqi.arch@...edance.com, baohua@...nel.org,
        anshuman.khandual@....com, willy@...radead.org, ioworker0@...il.com,
        yang@...amperecomputing.com, baolin.wang@...ux.alibaba.com,
        ziy@...dia.com, hughd@...gle.com
Subject: Re: [PATCH v2 2/2] mm: Optimize mremap() by PTE batching

On Sun, May 18, 2025 at 01:47:35PM +0530, Dev Jain wrote:
>
>
> On 08/05/25 3:34 pm, Lorenzo Stoakes wrote:
> > Before getting into the review, just to say thanks for refactoring as per
> > my (and of course other's) comments, much appreciated and big improvement!
> > :)
> >
> > We're getting there...
> >
> > On Wed, May 07, 2025 at 11:32:56AM +0530, Dev Jain wrote:
> > > To use PTE batching, we want to determine whether the folio mapped by
> > > the PTE is large, thus requiring the use of vm_normal_folio(). We want
> > > to avoid the cost of vm_normal_folio() if the code path doesn't already
> > > require the folio. For arm64, pte_batch_hint() does the job. To generalize
> > > this hint, add a helper which will determine whether two consecutive PTEs
> > > point to consecutive PFNs, in which case there is a high probability that
> > > the underlying folio is large.
> > > Next, use folio_pte_batch() to optimize move_ptes(). On arm64, if the ptes
> > > are painted with the contig bit, then ptep_get() will iterate through all 16
> > > entries to collect a/d bits. Hence this optimization will result in a 16x
> > > reduction in the number of ptep_get() calls. Next, ptep_get_and_clear()
> > > will eventually call contpte_try_unfold() on every contig block, thus
> > > flushing the TLB for the complete large folio range. Instead, use
> > > get_and_clear_full_ptes() so as to elide TLBIs on each contig block, and only
> > > do them on the starting and ending contig block.
> > >
> > > Signed-off-by: Dev Jain <dev.jain@....com>
> > > ---
> > >   include/linux/pgtable.h | 29 +++++++++++++++++++++++++++++
> > >   mm/mremap.c             | 37 ++++++++++++++++++++++++++++++-------
> > >   2 files changed, 59 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > > index b50447ef1c92..38dab1f562ed 100644
> > > --- a/include/linux/pgtable.h
> > > +++ b/include/linux/pgtable.h
> > > @@ -369,6 +369,35 @@ static inline pgd_t pgdp_get(pgd_t *pgdp)
> > >   }
> > >   #endif
> > >
> > > +/**
> > > + * maybe_contiguous_pte_pfns - Hint whether the page mapped by the pte belongs
> > > + * to a large folio.
> > > + * @ptep: Pointer to the page table entry.
> > > + * @pte: The page table entry.
> > > + *
> > > + * This helper is invoked when the caller wants to batch over a set of ptes
> > > + * mapping a large folio, but the concerned code path does not already have
> > > + * the folio. We want to avoid the cost of vm_normal_folio() only to find that
> > > + * the underlying folio was small; i.e keep the small folio case as fast as
> > > + * possible.
> > > + *
> > > + * The caller must ensure that ptep + 1 exists.
> > > + */
> > > +static inline bool maybe_contiguous_pte_pfns(pte_t *ptep, pte_t pte)
> > > +{
> > > +	pte_t *next_ptep, next_pte;
> > > +
> > > +	if (pte_batch_hint(ptep, pte) != 1)
> > > +		return true;
> > > +
> > > +	next_ptep = ptep + 1;
> > > +	next_pte = ptep_get(next_ptep);
> > > +	if (!pte_present(next_pte))
> > > +		return false;
> > > +
> > > +	return unlikely(pte_pfn(next_pte) - pte_pfn(pte) == 1);
> >
> > Let's not do unlikely()'s unless we have data for them... it shouldn't mean
> > 'what the programmer believes' :)
> >
> > > +}
> >
> > Yeah I'm with Andrew and Anshuman, I mean this is kind of a nasty interface
> > (I mean _perhaps_ unavoidably) and we've done the relevant check in
> > mremap_folio_pte_batch(), so let's just move it there with comments, as this
> >
> > > +
> > >   #ifndef __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> > >   static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> > >   					    unsigned long address,
> > > diff --git a/mm/mremap.c b/mm/mremap.c
> > > index 0163e02e5aa8..9c88a276bec4 100644
> > > --- a/mm/mremap.c
> > > +++ b/mm/mremap.c
> > > @@ -170,6 +170,23 @@ static pte_t move_soft_dirty_pte(pte_t pte)
> > >   	return pte;
> > >   }
> > >
> > > +/* mremap a batch of PTEs mapping the same large folio */
> > > +static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > > +		pte_t *ptep, pte_t pte, int max_nr)
> > > +{
> > > +	const fpb_t flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > > +	struct folio *folio;
> > > +	int nr = 1;
> > > +
> > > +	if ((max_nr != 1) && maybe_contiguous_pte_pfns(ptep, pte)) {
> > > +		folio = vm_normal_folio(vma, addr, pte);
> > > +		if (folio && folio_test_large(folio))
> > > +			nr = folio_pte_batch(folio, addr, ptep, pte, max_nr,
> > > +					     flags, NULL, NULL, NULL);
> > > +	}
> >
> > This needs some refactoring, avoid nesting at all costs :)
> >
> > We'll want to move the maybe_contiguous_pte_pfns() function over here, so
> > that'll change things, but in general let's use a guard clause.
> >
> > So an if block like:
> >
> > if (foo) {
> > 	... bunch of logic ...
> > }
> >
> > Is better replaced with a guard clause so you have:
> >
> > if (!foo)
> > 	return ...;
> >
> > ... bunch of logic ...
> >
> > Here we could really expand things out to make things SUPER clear like:
> >
> > static int mremap_folio_pte_batch(struct vm_area_struct *vma, unsigned long addr,
> > 		pte_t *ptep, pte_t pte, int max_nr)
> > {
> > 	const fpb_t flags;
> > 	struct folio *folio;
> >
> > 	if (max_nr == 1)
> > 		return 1;
> > 	if (!maybe_contiguous_pte_pfns(ptep, pte)) // obviously replace with open code...
> > 		return 1;
> >
> > 	folio = vm_normal_folio(vma, addr, pte);
> > 	if (!folio)
> > 		return 1;
> > 	if (!folio_test_large(folio))
> > 		return 1;
> >
> > 	flags = FPB_IGNORE_DIRTY | FPB_IGNORE_SOFT_DIRTY;
> > 	return folio_pte_batch(folio, addr, ptep, pte, max_nr,
> > 		flags, NULL, NULL, NULL);
> > }
> >
> > I mean you could argue assign nr would be neater here, but you get the point!
> >
> > David mentioned a point about this code over in v1 discussion (see
> > [0]). Trying to bring converastion here to avoid it being split across
> > old/new series. There he said:
> >
> > David H:
> > > (2) Do we really need "must be part of the same folio", or could be just batch over present
> > > ptes that map consecutive PFNs? In that case, a helper that avoids folio_pte_batch() completely
> > > might be better.
> >
> > Hm, if we didn't do the batch test, can we batch a split large folio here ok?
> > I'm guessing we can in which case this check is actually limiting...
> >
> > Are we _explicitly_ only considering the cont pte case and ignoring the
> > split THP case?
> >
> > [0]: https://lore.kernel.org/all/887fb371-409e-4dad-b4ff-38b85bfddf95@redhat.com/
> >
> > And in what circumstances will the hint be set, with a present subsequent
> > PTE but !folio_test_large()?
> >
> > I guess the hint might not be taken? But then isn't the valid check just
> > folio_test_large() and we don't need this batched check at all?
> >
> > Is it only to avoid the split THP case?
> >
> > We definitely need some clarity here, and a comment in the code explaining
> > what's going on as this is subtle stuff.
>
> I am focussed only on batching large folios. Split THPs won't be batched;
> you can use pte_batch() (from David's refactoring) and
> figure the split THP batch out, but then get_and_clear_full_ptes()
> will be gathering a/d bits and smearing them across the batch, which will be
> incorrect. Even if we introduce a new version of get_and_clear_full_ptes()
> which does not gather a/d bits, then if the pte_batch actually belongs to a
> folio, then we will *not* be smearing a/d bits, which is again wrong. So in
> any case we must know what the underlying folio looks like :) So my agenda
> for v3 is,

Right, ack, there being a large folio per se doesn't mean A/D collection is
appropriate in THP case.

I guess then this is really _only_ about the mTHP case, where essentially
the other PTEs are to be disregarded going forwards?

>
> - Incorporate your refactoring comments
> - Remove maybe_contiguous_pte_pfns and just use vm_normal_folio +
> folio_test_large
> - Fix indentation
>
> Sounds good?

Sure, but can we hold off until the mprotect() stuff is done first please?
I mean obviously you're free to do things as you like, but this will help
workload-wise on my side :>)

Thanks!

>
> >
> > > +	return nr;
> > > +}
> > > +
> > >   static int move_ptes(struct pagetable_move_control *pmc,
> > >   		unsigned long extent, pmd_t *old_pmd, pmd_t *new_pmd)
> > >   {
> > > @@ -177,7 +194,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	bool need_clear_uffd_wp = vma_has_uffd_without_event_remap(vma);
> > >   	struct mm_struct *mm = vma->vm_mm;
> > >   	pte_t *old_ptep, *new_ptep;
> > > -	pte_t pte;
> > > +	pte_t old_pte, pte;
> > >   	pmd_t dummy_pmdval;
> > >   	spinlock_t *old_ptl, *new_ptl;
> > >   	bool force_flush = false;
> > > @@ -186,6 +203,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	unsigned long old_end = old_addr + extent;
> > >   	unsigned long len = old_end - old_addr;
> > >   	int err = 0;
> > > +	int max_nr;
> > >
> > >   	/*
> > >   	 * When need_rmap_locks is true, we take the i_mmap_rwsem and anon_vma
> > > @@ -236,12 +254,13 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   	flush_tlb_batched_pending(vma->vm_mm);
> > >   	arch_enter_lazy_mmu_mode();
> > >
> > > -	for (; old_addr < old_end; old_ptep++, old_addr += PAGE_SIZE,
> > > -				   new_ptep++, new_addr += PAGE_SIZE) {
> > > -		if (pte_none(ptep_get(old_ptep)))
> > > +	for (int nr = 1; old_addr < old_end; old_ptep += nr, old_addr += nr * PAGE_SIZE,
> > > +				   new_ptep += nr, new_addr += nr * PAGE_SIZE) {
> >
> > Really nitty thing here but the indentation is all messed up here, I mean
> > nothing is going to be nice but maybe indent by two tabs below 'for'.
> >
> > I'm not a fan of this declaration of nr, typically in a for loop a declaration
> > here would be the counter, so this is just confusing.
> >
> > In the old implementation, declaring nr in the for loop would make sense,
> > but in the newly refactored one you should just declare it at the top.
> >
> > Also as per Anshuman review, I think nr_ptes, max_nr_ptes would be better.
> >
> > I don't think 'nr' needs to be initialised either, since the conditional is
> > 'old_addr < old_end' and you _should_ only perform the
> >
> > > +		max_nr = (old_end - old_addr) >> PAGE_SHIFT;
> > > +		old_pte = ptep_get(old_ptep);
> > > +		if (pte_none(old_pte))
> >
> > This seems broken.
> >
> > You're missing a nr assignment here, so you'll happen to offset by the
> > number of pages of the last folio you encountered?
> >
> > Should be:
> >
> > 	if (pte_none(old_pte)) {
> > 		nr_ptes = 1;
> > 		continue;
> > 	}
> >
> > Or, alternatively, you can reset nr_ptes to 1 at the start of each loop.
> >
> >
> > >   			continue;
> > >
> > > -		pte = ptep_get_and_clear(mm, old_addr, old_ptep);
> >
> > >   		/*
> > >   		 * If we are remapping a valid PTE, make sure
> > >   		 * to flush TLB before we drop the PTL for the
> > > @@ -253,8 +272,12 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   		 * the TLB entry for the old mapping has been
> > >   		 * flushed.
> > >   		 */
> > > -		if (pte_present(pte))
> > > +		if (pte_present(old_pte)) {
> > > +			nr = mremap_folio_pte_batch(vma, old_addr, old_ptep,
> > > +						    old_pte, max_nr);
> > >   			force_flush = true;
> > > +		}
> >
> > Thanks this is much clearer compared to v1
> >
> > > +		pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, 0);
> >
> > Nit but...
> >
> > Can we have a comment indicating what the last parameter refers to? I think
> > David maybe doens't like this so obviously if he prefers not that fine, but
> > I'm thinking something like:
> >
> > pte = get_and_clear_full_ptes(mm, old_addr, old_ptep, nr, /*full=*/false);
> >
> > I think we are good to just use 'false' here right? As it's only an int for
> > historical purposes...
> >
> > >   		pte = move_pte(pte, old_addr, new_addr);
> > >   		pte = move_soft_dirty_pte(pte);
> > >
> > > @@ -267,7 +290,7 @@ static int move_ptes(struct pagetable_move_control *pmc,
> > >   				else if (is_swap_pte(pte))
> > >   					pte = pte_swp_clear_uffd_wp(pte);
> > >   			}
> > > -			set_pte_at(mm, new_addr, new_ptep, pte);
> > > +			set_ptes(mm, new_addr, new_ptep, pte, nr);
> > >   		}
> > >   	}
> > >
> > > --
> > > 2.30.2
> > >
>