linux-kernel - Re: [PATCH v2 3/7] mm: Add batched versions of ptep_modify_prot

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <9a2129ce-55b6-47e7-a879-00e7982c8ec4@lucifer.local>
Date: Wed, 30 Apr 2025 15:34:55 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Ryan Roberts <ryan.roberts@....com>
Cc: Dev Jain <dev.jain@....com>, akpm@...ux-foundation.org, david@...hat.com,
        willy@...radead.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        catalin.marinas@....com, will@...nel.org, Liam.Howlett@...cle.com,
        vbabka@...e.cz, jannh@...gle.com, anshuman.khandual@....com,
        peterx@...hat.com, joey.gouly@....com, ioworker0@...il.com,
        baohua@...nel.org, kevin.brodsky@....com, quic_zhenhuah@...cinc.com,
        christophe.leroy@...roup.eu, yangyicong@...ilicon.com,
        linux-arm-kernel@...ts.infradead.org, namit@...are.com,
        hughd@...gle.com, yang@...amperecomputing.com, ziy@...dia.com
Subject: Re: [PATCH v2 3/7] mm: Add batched versions of
 ptep_modify_prot_start/commit

On Wed, Apr 30, 2025 at 03:09:50PM +0100, Ryan Roberts wrote:
> On 29/04/2025 14:52, Lorenzo Stoakes wrote:
> > On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> >> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> >> Architecture can override these helpers.
> >>
> >> Signed-off-by: Dev Jain <dev.jain@....com>
> >> ---
> >>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
> >>  1 file changed, 38 insertions(+)
> >>
> >> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> >> index b50447ef1c92..ed287289335f 100644
> >> --- a/include/linux/pgtable.h
> >> +++ b/include/linux/pgtable.h
> >> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> >>  }
> >>  #endif
> >>
> >> +/* See the comment for ptep_modify_prot_start */
> >
> > I feel like you really should add a little more here, perhaps point out
> > that it's batched etc.
> >
> >> +#ifndef modify_prot_start_ptes
> >> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> >> +		unsigned long addr, pte_t *ptep, unsigned int nr)
> >
> > This name is a bit confusing,
>
> On naming, the existing (modern) convention for single-pte helpers is to start
> the function name with ptep_. When we started adding batched versions, we took
> the approach of adding _ptes as a suffix. For example:
>
> set_pte_at()
> ptep_get_and_clear_full()
> ptep_set_wrprotect()
>
> set_ptes()
> get_and_clear_full_ptes()
> wrprotect_ptes()
>
> In this case, we already have ptep_modify_prot_start() and
> ptep_modify_prot_commit() for the existing single-pte helper versions. So
> according to the convention (or at least how I interpret the convention), the
> proposed names seem reasonable.
>

Right, I'm fine with following convention (we should), I just find 'ptes'
really ambiguous. It's not just a -set of PTE entries- it's very explicitly
for a large folio. I'd interpret some 'ptes' case to mean 'any number of
pte entries', though I suppose it'd not in practice be any different if
that were the intended use.

However, the proposed use case is large folio 'sub' PTEs and it'd be useful
in callers to know this is explicitly what you're doing.

I feel like '_batched_ptes' makes it clear it's a _specific_ set of PTE
entriess you're after (not just in effect multiple PTE entries).

However, I'm less insistent on this with a comment that explains what's
going on.

I don't want to hold this up with trivialities around naming...

ASIDE: I continue to absolutely HATE the ambiguity between 'PxD/PTE' and
'PxD/PTE entries' and the fact we use both as a short-hand for each
other. But that's not related to this series, just a pet peeve... :)

> > it's not any ptes, it's those pte entries
> > belonging to a large folio capped to the PTE table right that you are
> > batching right?
>
> Yes, but again by convention, that is captured in the kerneldoc comment for the
> functions. We are operating on a batch of *ptes* not on a folio or batch of
> folios. But it is a requirement of the function that the batch of ptes all lie
> within a single large folio (i.e. the pfns are sequential).

Ack, yeah don't love this nr stuff but fine if it's convention...

>  > Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
> > the name?
> >
> > We definitely need to mention in comment or name or _somewhere_ the intent
> > and motivation for this.
>
> Agreed!

...and luckily we are aligned on this :)

>
> >
> >> +{
> >> +	pte_t pte, tmp_pte;
> >> +
> >
> > are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
> > love this interface, where you require the user to know the number of
> > remaining PTE entries in a PTE table.
>
> For better or worse, that's the established convention. See part of comment for
> set_ptes() for example:
>
> """
>  * Context: The caller holds the page table lock.  The pages all belong
>  * to the same folio.  The PTEs are all in the same PMD.
> """
>
> >
> >> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> >> +	while (--nr) {
> >
> > This loop is a bit horrible. It seems needlessly confusing and you're in
> > _dire_ need of comments to explain what's going on.
> >
> > So my understanding is, you have the user figure out:
> >
> > nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)
> >
> > Then, you want to return the pte entry belonging to the start of the large
> > folio batch, but you want to adjust that pte value to propagate dirty and
> > young page table flags if any page table entries within the range contain
> > those page table flags, having called ptep_modify_prot_start() on all of
> > them?
> >
> > This is quite a bit to a. put in a header like this and b. not
> > comment/explain.
>
> This style is copied from get_and_clear_full_ptes(), which has this comment,
> which explains all this complexity. My vote would be to have a simple comment
> for this function:
>
> /**
>  * get_and_clear_full_ptes - Clear present PTEs that map consecutive pages of
>  *			     the same folio, collecting dirty/accessed bits.
>  * @mm: Address space the pages are mapped into.
>  * @addr: Address the first page is mapped at.
>  * @ptep: Page table pointer for the first entry.
>  * @nr: Number of entries to clear.
>  * @full: Whether we are clearing a full mm.
>  *
>  * May be overridden by the architecture; otherwise, implemented as a simple
>  * loop over ptep_get_and_clear_full(), merging dirty/accessed bits into the
>  * returned PTE.
>  *
>  * Note that PTE bits in the PTE range besides the PFN can differ. For example,
>  * some PTEs might be write-protected.
>  *
>  * Context: The caller holds the page table lock.  The PTEs map consecutive
>  * pages that belong to the same folio.  The PTEs are all in the same PMD.
>  */
>

OK I think the key bit here is 'consecutive pages of the same folio'.

I'd like at least a paragraph about implementation, yes the original
function doesn't have that (and should imo), something like:

	We perform the operation on the first PTE, then if any others
	follow, we invoke the ptep_modify_prot_start() for each and
	aggregate A/D bits.

Something like this.

Point taken on consistency though!

> >
> > So maybe something like:
> >
> > pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > /* Iterate through large folio tail PTEs. */
> > for (pg = 1; pg < nr; pg++) {
> > 	pte_t inner_pte;
> >
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> >
> > 	inner_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> > 	/* We must propagate A/D state from tail PTEs. */
> > 	if (pte_dirty(inner_pte))
> > 		pte = pte_mkdirty(pte);
> > 	if (pte_young(inner_pte))
> > 		pte = pte_mkyoung(pte);
> > }
> >
> > Would work better?
> >
> >
> >
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);
> >
> >
> >
> >> +		if (pte_dirty(tmp_pte))
> >> +			pte = pte_mkdirty(pte);
> >> +		if (pte_young(tmp_pte))
> >> +			pte = pte_mkyoung(pte);
> >
> > Why are you propagating these?
> >
> >> +	}
> >> +	return pte;
> >> +}
> >> +#endif
> >> +
> >> +/* See the comment for ptep_modify_prot_commit */
> >
> > Same comments as above, needs more meat on the bones!
> >
> >> +#ifndef modify_prot_commit_ptes
> >> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,
> >
> > Again need to reference large folio, batched or something relevant here,
> > 'ptes' is super vague.
> >
> >> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)
> >
> > Nit, but you put 'p' suffix on ptep but not on 'old_pte'?
> >
> > I'm even more concerned about the 'nr' API here now.
> >
> > So this is now a user-calculated:
> >
> > min3(large_folio_pages, number of pte entries left in ptep,
> > 	number of pte entries left in old_pte)
> >
> > It really feels like something that should be calculated here, or at least
> > be broken out more clearly.
> >
> > You definitely _at the very least_ need to document it in a comment.
> >
> >> +{
> >> +	for (;;) {
> >> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> >> +		if (--nr == 0)
> >> +			break;
> >
> > Why are you doing an infinite loop here with a break like this? Again feels
> > needlessly confusing.
>
> I agree it's not pretty to look at. But apparently it's the most efficient. This
> is Willy's commit that started it all: Commit bcc6cc832573 ("mm: add default
> definition of set_ptes()").
>
> For the record, I think all your comments make good sense, Lorenzo. But there is
> an established style, and personally I think at this point is it more confusing
> to break from that style.

This isn't _quite_ style, I'd say it's implementation, we're kind of
crossing over into something a little more I'd say :) but I mean I get your
point, sure.

I mean, fine, if (I presume you're referring _only_ to the for (;;) case
above) you are absolutely certain it is more performant in practice I
wouldn't want to stand in the way of that.

I would at least like a comment in the commit message about propagating an
odd loop for performance though to explain the 'qualities'... :)

>
> Thanks,
> Ryan
>
>
> >
> > I think it's ok to duplicate this single line for the sake of clarity,
> > also.
> >
> > Which gives us:
> >
> > unsigned long pg;
> >
> > ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > for (pg = 1; pg < nr; pg++) {
> > 	ptep++;
> > 	addr += PAGE_SIZE;
> > 	old_pte = pte_next_pfn(old_pte);
> > 	pte = pte_next_pfn(pte);
> >
> > 	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> > }
> >
> > There are alternative approaches, but I think doing an infinite loop that
> > breaks and especially the confusing 'if (--foo) break;' stuff is much
> > harder to parse than a super simple ranged loop.
> >
> >> +		ptep++;
> >> +		addr += PAGE_SIZE;
> >> +		old_pte = pte_next_pfn(old_pte);
> >> +		pte = pte_next_pfn(pte);
> >> +	}
> >> +}
> >> +#endif
> >> +
> >>  /*
> >>   * On some architectures hardware does not set page access bit when accessing
> >>   * memory page, it is responsibility of software setting this bit. It brings
> >> --
> >> 2.30.2
> >>
>