linux-kernel - Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20251215184413.19589400a74c2aadb42a2eca@linux-foundation.org>
Date: Mon, 15 Dec 2025 18:44:13 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
 david@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 mingo@...hat.com, mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
 tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
 chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
 konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page
 ranges

On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:

> Clear contiguous page ranges in folio_zero_user() instead of clearing
> a single page at a time. Exposing larger ranges enables extent based
> processor optimizations.
> 
> However, because the underlying clearing primitives do not, or might
> not be able to check to call cond_resched() to check if preemption
> is required, limit the worst case preemption latency by doing the
> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
> 
> For architectures that define clear_pages(), we assume that the
> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
> worth of pages. This should be large enough to allow the processor
> to optimize the operation and yet small enough that we see reasonable
> preemption latency for when this optimization is not possible
> (ex. slow microarchitectures, memory bandwidth saturation.)
> 
> Architectures that don't define clear_pages() will continue to use
> the base value (single page). And, preemptible models don't need
> invocations of cond_resched() so don't care about the batch size.
> 
> The resultant performance depends on the kinds of optimizations
> available to the CPU for the region size being cleared. Two classes
> of optimizations:
> 
>   - clearing iteration costs are amortized over a range larger
>     than a single page.
>   - cacheline allocation elision (seen on AMD Zen models).

8MB is a big chunk of memory.

> Testing a demand fault workload shows an improved baseline from the
> first optimization and a larger improvement when the region being
> cleared is large enough for the second optimization.
> 
> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):

So we break out of the copy to run cond_resched() 8192 times?  This sounds
like a minor cost.

>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
> 
>                     page-at-a-time     contiguous clearing      change
> 
>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
> 
>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
> 
>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy

And yet those 8192 cond_resched()'s have a huge impact on the
performance!  I find this result very surprising.  Is it explainable?

>  [#] Notice that we perform much better with preempt=full|lazy. As
>   mentioned above, preemptible models not needing explicit invocations
>   of cond_resched() allow clearing of the full extent (1GB) as a
>   single unit.
>   In comparison the maximum extent used for preempt=none|voluntary is
>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> 
>   The larger extent allows the processor to elide cacheline
>   allocation (on Milan the threshold is LLC-size=32MB.)

It is this?

> Also as mentioned earlier, the baseline improvement is not specific to
> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
> improvement as the Milan pg-sz=2MB workload above (~30%).
>