lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8780e63d-22c1-4133-a800-dec50fd1b5fa@lucifer.local>
Date: Tue, 29 Apr 2025 14:52:41 +0100
From: Lorenzo Stoakes <lorenzo.stoakes@...cle.com>
To: Dev Jain <dev.jain@....com>
Cc: akpm@...ux-foundation.org, ryan.roberts@....com, david@...hat.com,
        willy@...radead.org, linux-mm@...ck.org, linux-kernel@...r.kernel.org,
        catalin.marinas@....com, will@...nel.org, Liam.Howlett@...cle.com,
        vbabka@...e.cz, jannh@...gle.com, anshuman.khandual@....com,
        peterx@...hat.com, joey.gouly@....com, ioworker0@...il.com,
        baohua@...nel.org, kevin.brodsky@....com, quic_zhenhuah@...cinc.com,
        christophe.leroy@...roup.eu, yangyicong@...ilicon.com,
        linux-arm-kernel@...ts.infradead.org, namit@...are.com,
        hughd@...gle.com, yang@...amperecomputing.com, ziy@...dia.com
Subject: Re: [PATCH v2 3/7] mm: Add batched versions of
 ptep_modify_prot_start/commit

On Tue, Apr 29, 2025 at 10:53:32AM +0530, Dev Jain wrote:
> Batch ptep_modify_prot_start/commit in preparation for optimizing mprotect.
> Architecture can override these helpers.
>
> Signed-off-by: Dev Jain <dev.jain@....com>
> ---
>  include/linux/pgtable.h | 38 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 38 insertions(+)
>
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index b50447ef1c92..ed287289335f 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -891,6 +891,44 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  #endif
>
> +/* See the comment for ptep_modify_prot_start */

I feel like you really should add a little more here, perhaps point out
that it's batched etc.

> +#ifndef modify_prot_start_ptes
> +static inline pte_t modify_prot_start_ptes(struct vm_area_struct *vma,
> +		unsigned long addr, pte_t *ptep, unsigned int nr)

This name is a bit confusing, it's not any ptes, it's those pte entries
belonging to a large folio capped to the PTE table right that you are
batching right?

Perhaps modify_prot_start_large_folio() ? Or something with 'batched' in
the name?

We definitely need to mention in comment or name or _somewhere_ the intent
and motivation for this.

> +{
> +	pte_t pte, tmp_pte;
> +

are we not validating what 'nr' is? Even with debug asserts? I'm not sure I
love this interface, where you require the user to know the number of
remaining PTE entries in a PTE table.

> +	pte = ptep_modify_prot_start(vma, addr, ptep);
> +	while (--nr) {

This loop is a bit horrible. It seems needlessly confusing and you're in
_dire_ need of comments to explain what's going on.

So my understanding is, you have the user figure out:

nr = min(nr_pte_entries_in_pte, nr_pgs_in_folio)

Then, you want to return the pte entry belonging to the start of the large
folio batch, but you want to adjust that pte value to propagate dirty and
young page table flags if any page table entries within the range contain
those page table flags, having called ptep_modify_prot_start() on all of
them?

This is quite a bit to a. put in a header like this and b. not
comment/explain.

So maybe something like:

pte = ptep_modify_prot_start(vma, addr, ptep);

/* Iterate through large folio tail PTEs. */
for (pg = 1; pg < nr; pg++) {
	pte_t inner_pte;

	ptep++;
	addr += PAGE_SIZE;

	inner_pte = ptep_modify_prot_start(vma, addr, ptep);

	/* We must propagate A/D state from tail PTEs. */
	if (pte_dirty(inner_pte))
		pte = pte_mkdirty(pte);
	if (pte_young(inner_pte))
		pte = pte_mkyoung(pte);
}

Would work better?



> +		ptep++;
> +		addr += PAGE_SIZE;
> +		tmp_pte = ptep_modify_prot_start(vma, addr, ptep);



> +		if (pte_dirty(tmp_pte))
> +			pte = pte_mkdirty(pte);
> +		if (pte_young(tmp_pte))
> +			pte = pte_mkyoung(pte);

Why are you propagating these?

> +	}
> +	return pte;
> +}
> +#endif
> +
> +/* See the comment for ptep_modify_prot_commit */

Same comments as above, needs more meat on the bones!

> +#ifndef modify_prot_commit_ptes
> +static inline void modify_prot_commit_ptes(struct vm_area_struct *vma, unsigned long addr,

Again need to reference large folio, batched or something relevant here,
'ptes' is super vague.

> +		pte_t *ptep, pte_t old_pte, pte_t pte, unsigned int nr)

Nit, but you put 'p' suffix on ptep but not on 'old_pte'?

I'm even more concerned about the 'nr' API here now.

So this is now a user-calculated:

min3(large_folio_pages, number of pte entries left in ptep,
	number of pte entries left in old_pte)

It really feels like something that should be calculated here, or at least
be broken out more clearly.

You definitely _at the very least_ need to document it in a comment.

> +{
> +	for (;;) {
> +		ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
> +		if (--nr == 0)
> +			break;

Why are you doing an infinite loop here with a break like this? Again feels
needlessly confusing.

I think it's ok to duplicate this single line for the sake of clarity,
also.

Which gives us:

unsigned long pg;

ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
for (pg = 1; pg < nr; pg++) {
	ptep++;
	addr += PAGE_SIZE;
	old_pte = pte_next_pfn(old_pte);
	pte = pte_next_pfn(pte);

	ptep_modify_prot_commit(vma, addr, ptep, old_pte, pte);
}

There are alternative approaches, but I think doing an infinite loop that
breaks and especially the confusing 'if (--foo) break;' stuff is much
harder to parse than a super simple ranged loop.

> +		ptep++;
> +		addr += PAGE_SIZE;
> +		old_pte = pte_next_pfn(old_pte);
> +		pte = pte_next_pfn(pte);
> +	}
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> --
> 2.30.2
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ