linux-kernel - Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20251216071250.e49ecf7490acf7f377dbfdc0@linux-foundation.org>
Date: Tue, 16 Dec 2025 07:12:50 -0800
From: Andrew Morton <akpm@...ux-foundation.org>
To: Ankur Arora <ankur.a.arora@...cle.com>
Cc: linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
 david@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com, hpa@...or.com,
 mingo@...hat.com, mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
 tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
 chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
 konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page
 ranges

On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:

> >>  [#] Notice that we perform much better with preempt=full|lazy. As
> >>   mentioned above, preemptible models not needing explicit invocations
> >>   of cond_resched() allow clearing of the full extent (1GB) as a
> >>   single unit.
> >>   In comparison the maximum extent used for preempt=none|voluntary is
> >>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
> >>
> >>   The larger extent allows the processor to elide cacheline
> >>   allocation (on Milan the threshold is LLC-size=32MB.)
> >
> > It is this?
> 
> Yeah I think so. For size >= 32MB, the microcoder can really just elide
> cacheline allocation, and with the foreknowledge of the extent can perhaps
> optimize on cache coherence traffic (this last one is my speculation).
> 
> On cacheline allocation elision, compare the L1-dcache-load in the two versions
> below:
> 
> pg-sz=1GB:
>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>   -    544,878,976      instructions                     #    0.06  insn per cycle
>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
> 
>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>   +     10,979,121      instructions                     #    0.00  insn per cycle
>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
> 

That says L1 d-cache loads went from 600 million/sec down to 20
million/sec when using 32MB chunks?

Do you know what happens to preemption latency if you increase that
chunk size from 8MB to 32MB?  At 42GB/sec, 32MB will take less than a
millisecond, yes?  I'm not aware of us really having any latency
targets in these preemption modes, but 1 millisecond sounds pretty
good.