linux-kernel - Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874ipqexai.fsf@oracle.com>
Date: Mon, 15 Dec 2025 22:49:25 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, x86@...nel.org, david@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
        tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
        chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges


Andrew Morton <akpm@...ux-foundation.org> writes:

> On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>>
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>>
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>>
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>>
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>>
>>   - clearing iteration costs are amortized over a range larger
>>     than a single page.
>>   - cacheline allocation elision (seen on AMD Zen models).
>
> 8MB is a big chunk of memory.
>
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>>
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>
> So we break out of the copy to run cond_resched() 8192 times?  This sounds
> like a minor cost.
>
>>   $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>
>>                     page-at-a-time     contiguous clearing      change
>>
>>                   (GB/s  +- %stdev)     (GB/s  +- %stdev)
>>
>>    pg-sz=2MB       12.92  +- 2.55%        17.03  +-  0.70%       + 31.8%	preempt=*
>>
>>    pg-sz=1GB       17.14  +- 2.27%        18.04  +-  1.05%       +  5.2%	preempt=none|voluntary
>>    pg-sz=1GB       17.26  +- 1.24%        42.17  +-  4.21% [#]   +144.3%	preempt=full|lazy
>
> And yet those 8192 cond_resched()'s have a huge impact on the
> performance!  I find this result very surprising.  Is it explainable?

I agree about this being surprising. On the 2MB extent, I still find the
30% quite high but I think a decent portion of it is:

  - on x86, the CPU is executing a single microcoded insn: REP; STOSB. And,
    because it's doing it for a 2MB instead of a bunch of 4K extents it
    saves the microcoding costs (and I suspect it allows it to do some
    range operation which also helps.)

  - the second reason (from Ingo) was again the per-iteration cost, which
    given all of the mitigations on x86 is quite substantial.

    On the AMD systems I had tested on, I think there's at least the cost
    of RET misprediction in there.

    (https://lore.kernel.org/lkml/Z_yzshvBmYiPrxU0@gmail.com/)

>>  [#] Notice that we perform much better with preempt=full|lazy. As
>>   mentioned above, preemptible models not needing explicit invocations
>>   of cond_resched() allow clearing of the full extent (1GB) as a
>>   single unit.
>>   In comparison the maximum extent used for preempt=none|voluntary is
>>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>>
>>   The larger extent allows the processor to elide cacheline
>>   allocation (on Milan the threshold is LLC-size=32MB.)
>
> It is this?

Yeah I think so. For size >= 32MB, the microcoder can really just elide
cacheline allocation, and with the foreknowledge of the extent can perhaps
optimize on cache coherence traffic (this last one is my speculation).

On cacheline allocation elision, compare the L1-dcache-load in the two versions
below:

pg-sz=1GB:
  -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
  -    544,878,976      instructions                     #    0.06  insn per cycle
  -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
  -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)

  +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
  +     10,979,121      instructions                     #    0.00  insn per cycle
  +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
  +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)

(From an earlier version of this series: https://lore.kernel.org/lkml/20250414034607.762653-5-ankur.a.arora@oracle.com/)

Maybe I should have kept it in this commit :).

>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>

--
ankur