lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <87fr994hot.fsf@oracle.com>
Date: Wed, 17 Dec 2025 00:48:50 -0800
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, x86@...nel.org, david@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        mjguzik@...il.com, luto@...nel.org, peterz@...radead.org,
        tglx@...utronix.de, willy@...radead.org, raghavendra.kt@....com,
        chleroy@...nel.org, ioworker0@...il.com, boris.ostrovsky@...cle.com,
        konrad.wilk@...cle.com
Subject: Re: [PATCH v10 7/8] mm, folio_zero_user: support clearing page ranges


Andrew Morton <akpm@...ux-foundation.org> writes:

> On Mon, 15 Dec 2025 22:49:25 -0800 Ankur Arora <ankur.a.arora@...cle.com> wrote:
>
>> >>  [#] Notice that we perform much better with preempt=full|lazy. As
>> >>   mentioned above, preemptible models not needing explicit invocations
>> >>   of cond_resched() allow clearing of the full extent (1GB) as a
>> >>   single unit.
>> >>   In comparison the maximum extent used for preempt=none|voluntary is
>> >>   PROCESS_PAGES_NON_PREEMPT_BATCH (8MB).
>> >>
>> >>   The larger extent allows the processor to elide cacheline
>> >>   allocation (on Milan the threshold is LLC-size=32MB.)
>> >
>> > It is this?
>>
>> Yeah I think so. For size >= 32MB, the microcoder can really just elide
>> cacheline allocation, and with the foreknowledge of the extent can perhaps
>> optimize on cache coherence traffic (this last one is my speculation).
>>
>> On cacheline allocation elision, compare the L1-dcache-load in the two versions
>> below:
>>
>> pg-sz=1GB:
>>   -  9,250,034,512      cycles                           #    2.418 GHz                         ( +-  0.43% )  (46.16%)
>>   -    544,878,976      instructions                     #    0.06  insn per cycle
>>   -  2,331,332,516      L1-dcache-loads                  #  609.471 M/sec                       ( +-  0.03% )  (46.16%)
>>   -  1,075,122,960      L1-dcache-load-misses            #   46.12% of all L1-dcache accesses   ( +-  0.01% )  (46.15%)
>>
>>   +  3,688,681,006      cycles                           #    2.420 GHz                         ( +-  3.48% )  (46.01%)
>>   +     10,979,121      instructions                     #    0.00  insn per cycle
>>   +     31,829,258      L1-dcache-loads                  #   20.881 M/sec                       ( +-  4.92% )  (46.34%)
>>   +     13,677,295      L1-dcache-load-misses            #   42.97% of all L1-dcache accesses   ( +-  6.15% )  (46.32%)
>>
>
> That says L1 d-cache loads went from 600 million/sec down to 20
> million/sec when using 32MB chunks?

Sorry, should have mentioned that that run was with preempt=full/lazy.
For those the chunk size is the whole page (GB page in that case).

The context for 32MB was that that's the LLC-size for these systems.
And, from observed behaviour the cacheline allocation elision
optimization only happens when the chunk size used is larger than that.

> Do you know what happens to preemption latency if you increase that
> chunk size from 8MB to 32MB?

So, I gathered some numbers on a Zen4/Genoa system. The ones above are
from Zen3/Milan.

region-sz=64GB, loop-count=3 (total region-size=3*64GB):

                                Bandwidth    L1-dcache-loads

    pg-sz=2MB, batch-sz= 8MB   25.10 GB/s    6,745,859,855  # 2.00 L1-dcache-loads/64B
       # pg-sz=2MB for context

    pg-sz=1GB, batch-sz= 8MB   26.88 GB/s    6,469,900,728  # 2.00 L1-dcache-loads/64B
    pg-sz=1GB, batch-sz=32MB   38.69 GB/s    2,559,249,546  # 0.79 L1-dcache-loads/64B
    pg-sz=1GB, batch-sz=64MB   46.91 GB/s      919,539,544  # 0.28 L1-dcache-loads/64B

    pg-sz=1GB, batch-sz= 1GB   58.68 GB/s       79,458,439  # 0.024 L1-dcache-loads/64B

All of these are for preempt=none, and with boost=0. (With boost=1 the
BW increases by ~25%.)

So, I wasn't quite right about the LLC-size=32MB being the threshold for
this optimization. There is a change in behaviour at that point but it
does improve beyond that.
(Ideally this threshold would be a processor MSR. That way we could
use this for 2MB pages as well. Oh well.)

> At 42GB/sec, 32MB will take less than a
> millisecond, yes?  I'm not aware of us really having any latency
> targets in these preemption modes, but 1 millisecond sounds pretty
> good.

Agreed. The only complaint threshold I see is 100ms (default value of
sysctl_resched_latency_warn_ms) which is pretty far from ~1ms.

And having a threshold of 32MB might benefit other applications since
we won't be discarding their cachelines in favour of filling up the
cache with zeroes.

I think the only problem cases might be slow uarchs and workloads where
the memory bus is saturated which might dilate the preemption latency.

And, even if the operation takes say ~20ms, that should still leave us
with a reasonably large margin.
(And, any latency senstive users are probably not running with
preempt=none/voluntary.)

--
ankur

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ