[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87fr994hot.fsf@oracle.com>
Date: Wed, 17 Dec 2025 00:48:50 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
linux-mm@...ck.org, x86@...nel.org, david@...nel.org, bp@...en8.de,
dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges
Andrew Morton <akpm@...ux-foundation.org> writes:
> On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> >> [#] Notice that we perform much better with preempt=full|lazy. As
>> >> mentioned above, preemptible models not needing explicit invocations
>> >> of cond_resched() allow clearing of the full extent (1GB) as a
>> >> single unit.
>> >> In comparison the maximum extent used for preempt=none|voluntary is
>> >> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>> >>
>> >> The larger extent allows the processor to elide cacheline
>> >> allocation (on Milan the threshold is LLC-size=32MB.)
>> >
>> > It is this?
>>
>> Yeah I think so. For size >= 32MB, the microcoder can really just elide
>> cacheline allocation, and with the foreknowledge of the extent can perhaps
>> optimize on cache coherence traffic (this last one is my speculation).
>>
>> On cacheline allocation elision, compare the L1-dcache-load in the two versions
>> below:
>>
>> pg-sz=1GB:
>> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%)
>> - 544,878,976 instructions # 0.06 insn per cycle
>> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%)
>> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%)
>>
>> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%)
>> + 10,979,121 instructions # 0.00 insn per cycle
>> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%)
>> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%)
>>
>
> That says L1 d-cache loads went from 600 million/sec down to 20
> million/sec when using 32MB chunks?
Sorry, should have mentioned that that run was with preempt=full/lazy.
For those the chunk size is the whole page (GB page in that case).
The context for 32MB was that that's the LLC-size for these systems.
And, from observed behaviour the cacheline allocation elision
optimization only happens when the chunk size used is larger than that.
> Do you know what happens to preemption latency if you increase that
> chunk size from 8MB to 32MB?
So, I gathered some numbers on a Zen4/Genoa system. The ones above are
from Zen3/Milan.
region-sz=64GB, loop-count=3 (total region-size=3*64GB):
Bandwidth L1-dcache-loads
pg-sz=2MB, batch-sz= 8MB 25.10 GB/s 6,745,859,855 # 2.00 L1-dcache-loads/64B
# pg-sz=2MB for context
pg-sz=1GB, batch-sz= 8MB 26.88 GB/s 6,469,900,728 # 2.00 L1-dcache-loads/64B
pg-sz=1GB, batch-sz=32MB 38.69 GB/s 2,559,249,546 # 0.79 L1-dcache-loads/64B
pg-sz=1GB, batch-sz=64MB 46.91 GB/s 919,539,544 # 0.28 L1-dcache-loads/64B
pg-sz=1GB, batch-sz= 1GB 58.68 GB/s 79,458,439 # 0.024 L1-dcache-loads/64B
All of these are for preempt=none, and with boost=0. (With boost=1 the
BW increases by ~25%.)
So, I wasn't quite right about the LLC-size=32MB being the threshold for
this optimization. There is a change in behaviour at that point but it
does improve beyond that.
(Ideally this threshold would be a processor MSR. That way we could
use this for 2MB pages as well. Oh well.)
> At 42GB/sec, 32MB will take less than a
> millisecond, yes? I'm not aware of us really having any latency
> targets in these preemption modes, but 1 millisecond sounds pretty
> good.
Agreed. The only complaint threshold I see is 100ms (default value of
sysctl_resched_latency_warn_ms) which is pretty far from ~1ms.
And having a threshold of 32MB might benefit other applications since
we won't be discarding their cachelines in favour of filling up the
cache with zeroes.
I think the only problem cases might be slow uarchs and workloads where
the memory bus is saturated which might dilate the preemption latency.
And, even if the operation takes say ~20ms, that should still leave us
with a reasonably large margin.
(And, any latency senstive users are probably not running with
preempt=none/voluntary.)
--
ankur
Powered by blists - more mailing lists