[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <874ipqexai.fsf@oracle.com>
Date: Mon, 15 Dec 2025 22:49:25 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org, david@...nel.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
Andrew Morton <akpm@...ux-foundation.org> writes:
> On Mon, 15 Dec 2025 12:49:21 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> Clear contiguous page ranges in folio_zero_user() instead of clearing
>> a single page at a time. Exposing larger ranges enables extent based
>> processor optimizations.
>>
>> However, because the underlying clearing primitives do not, or might
>> not be able to check to call cond_resched() to check if preemption
>> is required, limit the worst case preemption latency by doing the
>> clearing in no more than PROCESS_PAGES_NON_PREEMPT_BATCH units.
>>
>> For architectures that define clear_pages(), we assume that the
>> clearing is fast and define PROCESS_PAGES_NON_PREEMPT_BATCH as 8MB
>> worth of pages. This should be large enough to allow the processor
>> to optimize the operation and yet small enough that we see reasonable
>> preemption latency for when this optimization is not possible
>> (ex. slow microarchitectures, memory bandwidth saturation.)
>>
>> Architectures that don't define clear_pages() will continue to use
>> the base value (single page). And, preemptible models don't need
>> invocations of cond_resched() so don't care about the batch size.
>>
>> The resultant performance depends on the kinds of optimizations
>> available to the CPU for the region size being cleared. Two classes
>> of optimizations:
>>
>> - clearing iteration costs are amortized over a range larger
>> than a single page.
>> - cacheline allocation elision (seen on AMD Zen models).
>
> 8MB is a big chunk of memory.
>
>> Testing a demand fault workload shows an improved baseline from the
>> first optimization and a larger improvement when the region being
>> cleared is large enough for the second optimization.
>>
>> AMD Milan (EPYC 7J13, boost=0, region=64GB on the local NUMA node):
>
> So we break out of the copy to run cond_resched() 8192 times? This sounds
> like a minor cost.
>
>> $ perf bench mem mmap -p $pg-sz -f demand -s 64GB -l 5
>>
>> page-at-a-time contiguous clearing change
>>
>> (GB/s +- %stdev) (GB/s +- %stdev)
>>
>> pg-sz=2MB 12.92 +- 2.55% 17.03 +- 0.70% + 31.8% preempt=*
>>
>> pg-sz=1GB 17.14 +- 2.27% 18.04 +- 1.05% + 5.2% preempt=none|voluntary
>> pg-sz=1GB 17.26 +- 1.24% 42.17 +- 4.21% [#] +144.3% preempt=full|lazy
>
> And yet those 8192 cond_resched()'s have a huge impact on the
> performance! I find this result very surprising. Is it explainable?
I agree about this being surprising. On the 2MB extent, I still find the
30% quite high but I think a decent portion of it is:
- on x86, the CPU is executing a single microcoded insn: REP; STOSB. And,
because it's doing it for a 2MB instead of a bunch of 4K extents it
saves the microcoding costs (and I suspect it allows it to do some
range operation which also helps.)
- the second reason (from Ingo) was again the per-iteration cost, which
given all of the mitigations on x86 is quite substantial.
On the AMD systems I had tested on, I think there's at least the cost
of RET misprediction in there.
(https://lore.kernel.org/lkml/Z_yzshvBmYiPrxU0@gmail.com/)
>> [#] Notice that we perform much better with preempt=full|lazy. As
>> mentioned above, preemptible models not needing explicit invocations
>> of cond_resched() allow clearing of the full extent (1GB) as a
>> single unit.
>> In comparison the maximum extent used for preempt=none|voluntary is
>> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>>
>> The larger extent allows the processor to elide cacheline
>> allocation (on Milan the threshold is LLC-size=32MB.)
>
> It is this?
Yeah I think so. For size >= 32MB, the microcoder can really just elide
cacheline allocation, and with the foreknowledge of the extent can perhaps
optimize on cache coherence traffic (this last one is my speculation).
On cacheline allocation elision, compare the L1-dcache-load in the two versions
below:
pg-sz=1GB:
- 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%)
- 544,878,976 instructions # 0.06 insn per cycle
- 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%)
- 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%)
+ 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%)
+ 10,979,121 instructions # 0.00 insn per cycle
+ 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%)
+ 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%)
(From an earlier version of this series: https://lore.kernel.org/lkml/20250414034607.762653-5-ankur.a.arora@oracle.com/)
Maybe I should have kept it in this commit :).
>> Also as mentioned earlier, the baseline improvement is not specific to
>> AMD Zen platforms. Intel Icelakex (pg-sz=2MB|1GB) sees a similar
>> improvement as the Milan pg-sz=2MB workload above (~30%).
>>
--
ankur
Powered by blists - more mailing lists