linux-kernel - Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6d452a1a-1edc-4e97-8b39-99dc48315bb8@redhat.com>
Date: Mon, 12 Feb 2024 14:43:35 +0100
From: David Hildenbrand <david@...hat.com>
To: Ryan Roberts <ryan.roberts@....com>,
 Catalin Marinas <catalin.marinas@....com>, Will Deacon <will@...nel.org>,
 Ard Biesheuvel <ardb@...nel.org>, Marc Zyngier <maz@...nel.org>,
 James Morse <james.morse@....com>, Andrey Ryabinin <ryabinin.a.a@...il.com>,
 Andrew Morton <akpm@...ux-foundation.org>,
 Matthew Wilcox <willy@...radead.org>, Mark Rutland <mark.rutland@....com>,
 Kefeng Wang <wangkefeng.wang@...wei.com>, John Hubbard
 <jhubbard@...dia.com>, Zi Yan <ziy@...dia.com>,
 Barry Song <21cnbao@...il.com>, Alistair Popple <apopple@...dia.com>,
 Yang Shi <shy828301@...il.com>, Nicholas Piggin <npiggin@...il.com>,
 Christophe Leroy <christophe.leroy@...roup.eu>,
 "Aneesh Kumar K.V" <aneesh.kumar@...nel.org>,
 "Naveen N. Rao" <naveen.n.rao@...ux.ibm.com>,
 Thomas Gleixner <tglx@...utronix.de>, Ingo Molnar <mingo@...hat.com>,
 Borislav Petkov <bp@...en8.de>, Dave Hansen <dave.hansen@...ux.intel.com>,
 "H. Peter Anvin" <hpa@...or.com>
Cc: linux-arm-kernel@...ts.infradead.org, x86@...nel.org,
 linuxppc-dev@...ts.ozlabs.org, linux-mm@...ck.org,
 linux-kernel@...r.kernel.org
Subject: Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in
 folio_pte_batch()

On 02.02.24 09:07, Ryan Roberts wrote:
> Some architectures (e.g. arm64) can tell from looking at a pte, if some
> follow-on ptes also map contiguous physical memory with the same pgprot.
> (for arm64, these are contpte mappings).
> 
> Take advantage of this knowledge to optimize folio_pte_batch() so that
> it can skip these ptes when scanning to create a batch. By default, if
> an arch does not opt-in, folio_pte_batch() returns a compile-time 1, so
> the changes are optimized out and the behaviour is as before.
> 
> arm64 will opt-in to providing this hint in the next patch, which will
> greatly reduce the cost of ptep_get() when scanning a range of contptes.
> 
> Tested-by: John Hubbard <jhubbard@...dia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@....com>
> ---
>   include/linux/pgtable.h | 18 ++++++++++++++++++
>   mm/memory.c             | 20 +++++++++++++-------
>   2 files changed, 31 insertions(+), 7 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index 50f32cccbd92..cba31f177d27 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,6 +212,24 @@ static inline int pmd_dirty(pmd_t pmd)
>   #define arch_flush_lazy_mmu_mode()	do {} while (0)
>   #endif
>   
> +#ifndef pte_batch_hint
> +/**
> + * pte_batch_hint - Number of pages that can be added to batch without scanning.
> + * @ptep: Page table pointer for the entry.
> + * @pte: Page table entry.
> + *
> + * Some architectures know that a set of contiguous ptes all map the same
> + * contiguous memory with the same permissions. In this case, it can provide a
> + * hint to aid pte batching without the core code needing to scan every pte.

I think we might want to document here the expectation regarding
dirty/accessed bits. folio_pte_batch() will ignore dirty bits only with
FPB_IGNORE_DIRTY. But especially for arm64, it makes sense to ignore them
always when batching, because the dirty bit may target any pte part of the
cont-pte group either way.

Maybe something like:

"
An architecture implementation may only ignore the PTE accessed and dirty bits.
Further, it may only ignore the dirty bit if that bit is already not
maintained with precision per PTE inside the hinted batch, and ptep_get()
would already have to collect it from various PTEs.
"

I think there are some more details to it, but I'm hoping something along
the lines above is sufficient.


> +
>   #ifndef pte_advance_pfn
>   static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr)
>   {
> diff --git a/mm/memory.c b/mm/memory.c
> index 65fbe4f886c1..902665b27702 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -988,16 +988,21 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   {
>   	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>   	const pte_t *end_ptep = start_ptep + max_nr;
> -	pte_t expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, 1), flags);
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t expected_pte = __pte_batch_clear_ignored(pte, flags);
> +	pte_t *ptep = start_ptep;
>   	bool writable;
> +	int nr;
>   
>   	if (any_writable)
>   		*any_writable = false;
>   
>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>   
> -	while (ptep != end_ptep) {
> +	nr = pte_batch_hint(ptep, pte);
> +	expected_pte = pte_advance_pfn(expected_pte, nr);
> +	ptep += nr;
> +

*Maybe* it's easier to get when initializing expected_pte+ptep only once.

Like:

[...]
pte_t expected_pte, *ptep;
[...]

nr = pte_batch_hint(start_ptep, pte);
expected_pte = __pte_batch_clear_ignored(pte_advance_pfn(pte, nr), flags);
ptep = start_ptep + nr;

> +	while (ptep < end_ptep) {
>   		pte = ptep_get(ptep);
>   		if (any_writable)
>   			writable = !!pte_write(pte);
> @@ -1011,17 +1016,18 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>   		 * corner cases the next PFN might fall into a different
>   		 * folio.
>   		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
>   			break;
>   
>   		if (any_writable)
>   			*any_writable |= writable;
>   
> -		expected_pte = pte_advance_pfn(expected_pte, 1);
> -		ptep++;
> +		nr = pte_batch_hint(ptep, pte);
> +		expected_pte = pte_advance_pfn(expected_pte, nr);
> +		ptep += nr;
>   	}
>   
> -	return ptep - start_ptep;
> +	return min(ptep - start_ptep, max_nr);
>   }

Acked-by: David Hildenbrand <david@...hat.com>

-- 
Cheers,

David / dhildenb