linux-kernel - Re: [RFC PATCH 1/3] mm: Introduce can_pte_batch

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <276b70aa-9853-40cc-8e7d-e790166706b5@redhat.com>
Date: Mon, 27 Oct 2025 20:24:40 +0100
From: David Hildenbrand <david@...hat.com>
To: Zhang Qilong <zhangqilong3@...wei.com>, akpm@...ux-foundation.org,
 lorenzo.stoakes@...cle.com, Liam.Howlett@...cle.com, vbabka@...e.cz,
 rppt@...nel.org, surenb@...gle.com, mhocko@...e.com, jannh@...gle.com,
 pfalcato@...e.de
Cc: linux-mm@...ck.org, linux-kernel@...r.kernel.org,
 wangkefeng.wang@...wei.com, sunnanyong@...wei.com
Subject: Re: [RFC PATCH 1/3] mm: Introduce can_pte_batch_count() for PTEs
 batch optimization.

On 27.10.25 15:03, Zhang Qilong wrote:
> Currently, the PTEs batch requires folio access, with the maximum
> quantity limited to the PFNs contained within the folio. However,
> in certain case (such as mremap_folio_pte_batch and mincore_pte_range),
> accessing the folio is unnecessary and expensive.
> 
> For scenarios that do not require folio access, this patch introduces
> can_pte_batch_count(). With contiguous physical addresses and identical
> PTE attribut bits, we can now process more page table entries at once,
> in batch, not just limited to entries mapped within a single folio. On
> the other hand, it avoid the folio access.
> 
> Signed-off-by: Zhang Qilong <zhangqilong3@...wei.com>
> ---
>   mm/internal.h | 76 +++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 58 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 1561fc2ff5b8..92034ca9092d 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -233,61 +233,62 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   		pte = pte_wrprotect(pte);
>   	return pte_mkold(pte);
>   }
>   
>   /**
> - * folio_pte_batch_flags - detect a PTE batch for a large folio
> - * @folio: The large folio to detect a PTE batch for.
> + * can_pte_batch_count - detect a PTE batch in range [ptep, to ptep + max_nr)

I really don't like the name.

Maybe it's just pte_batch().

>    * @vma: The VMA. Only relevant with FPB_MERGE_WRITE, otherwise can be NULL.
>    * @ptep: Page table pointer for the first entry.
>    * @ptentp: Pointer to a COPY of the first page table entry whose flags this
>    *	    function updates based on @flags if appropriate.
>    * @max_nr: The maximum number of table entries to consider.
>    * @flags: Flags to modify the PTE batch semantics.
>    *
> - * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> - * pages of the same large folio in a single VMA and a single page table.
> + * This interface is designed for this case that do not require folio access.
> + * If folio consideration is needed, please call folio_pte_batch_flags instead.
> + *
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive pages
> + * in a single VMA and a single page table.
>    *
>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN,
>    * the accessed bit, writable bit, dirty bit (unless FPB_RESPECT_DIRTY is set)
>    * and soft-dirty bit (unless FPB_RESPECT_SOFT_DIRTY is set).
>    *
> - * @ptep must map any page of the folio. max_nr must be at least one and
> + * @ptep point to the first entry in range, max_nr must be at least one and
>    * must be limited by the caller so scanning cannot exceed a single VMA and
>    * a single page table.
>    *
>    * Depending on the FPB_MERGE_* flags, the pte stored at @ptentp will
>    * be updated: it's crucial that a pointer to a COPY of the first
>    * page table entry, obtained through ptep_get(), is provided as @ptentp.
>    *
> - * This function will be inlined to optimize based on the input parameters;
> - * consider using folio_pte_batch() instead if applicable.
> + * The following folio_pte_batch_flags() deal with PTEs that mapped in a
> + * single folio. However can_pte_batch_count has the capability to handle
> + * PTEs that mapped in consecutive folios. If flags is not set, it will ignore
> + * the accessed, writable and dirty bits. Once the flags is set, the respect
> + * bit(s) will be compared in pte_same(), if the advanced pte_batch_hint()
> + * respect pte bit is different, pte_same() will return false and break. This
> + * ensures the correctness of handling multiple folio PTEs.
> + *
> + * This function will be inlined to optimize based on the input parameters.
>    *
>    * Return: the number of table entries in the batch.
>    */

I recall trouble if we try batching across folios:

commit 7b08b74f3d99f6b801250683c751d391128799ec (tag: mm-hotfixes-stable-2025-05-10-14-23)
Author: Petr Vaněk <arkamar@...as.cz>
Date:   Fri May 2 23:50:19 2025 +0200

     mm: fix folio_pte_batch() on XEN PV
     
     On XEN PV, folio_pte_batch() can incorrectly batch beyond the end of a
     folio due to a corner case in pte_advance_pfn().  Specifically, when the
     PFN following the folio maps to an invalidated MFN,
     
             expected_pte = pte_advance_pfn(expected_pte, nr);
     
     produces a pte_none().  If the actual next PTE in memory is also
     pte_none(), the pte_same() succeeds,
     
             if (!pte_same(pte, expected_pte))
                     break;
     
     the loop is not broken, and batching continues into unrelated memory.
     
...


-- 
Cheers

David / dhildenb