[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20251216071250.e49ecf7490acf7f377dbfdc0@linux-foundation.org>
Date: Tue, 16 Dec 2025 07:12:50 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
david@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
mingo@...hat.com, mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page
ranges
On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:
> >> [#] Notice that we perform much better with preempt=full|lazy. As
> >> mentioned above, preemptible models not needing explicit invocations
> >> of cond_resched() allow clearing of the full extent (1GB) as a
> >> single unit.
> >> In comparison the maximum extent used for preempt=none|voluntary is
> >> PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> >>
> >> The larger extent allows the processor to elide cacheline
> >> allocation (on Milan the threshold is LLC-size=32MB.)
> >
> > It is this?
>
> Yeah I think so. For size >= 32MB, the microcoder can really just elide
> cacheline allocation, and with the foreknowledge of the extent can perhaps
> optimize on cache coherence traffic (this last one is my speculation).
>
> On cacheline allocation elision, compare the L1-dcache-load in the two versions
> below:
>
> pg-sz=1GB:
> - 9,250,034,512 cycles # 2.418 GHz ( +- 0.43% ) (46.16%)
> - 544,878,976 instructions # 0.06 insn per cycle
> - 2,331,332,516 L1-dcache-loads # 609.471 M/sec ( +- 0.03% ) (46.16%)
> - 1,075,122,960 L1-dcache-load-misses # 46.12% of all L1-dcache accesses ( +- 0.01% ) (46.15%)
>
> + 3,688,681,006 cycles # 2.420 GHz ( +- 3.48% ) (46.01%)
> + 10,979,121 instructions # 0.00 insn per cycle
> + 31,829,258 L1-dcache-loads # 20.881 M/sec ( +- 4.92% ) (46.34%)
> + 13,677,295 L1-dcache-load-misses # 42.97% of all L1-dcache accesses ( +- 6.15% ) (46.32%)
>
That says L1 d-cache loads went from 600 million/sec down to 20
million/sec when using 32MB chunks?
Do you know what happens to preemption latency if you increase that
chunk size from 8MB to 32MB? At 42GB/sec, 32MB will take less than a
millisecond, yes? I'm not aware of us really having any latency
targets in these preemption modes, but 1 millisecond sounds pretty
good.
Powered by blists - more mailing lists