linux-kernel - Re: [PATCH v3 0/4] mm/folio_zero

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87jz7cq9wh.fsf@oracle.com>
Date: Tue, 22 Apr 2025 12:22:06 -0700
From: Ankur Arora <ankur.a.arora@...cle.com>
To: Raghavendra K T <raghavendra.kt@....com>
Cc: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, x86@...nel.org, torvalds@...ux-foundation.org,
        akpm@...ux-foundation.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, luto@...nel.org, peterz@...radead.org,
        paulmck@...nel.org, rostedt@...dmis.org, tglx@...utronix.de,
        willy@...radead.org, jon.grimm@....com, bharata@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing


Raghavendra K T <raghavendra.kt@....com> writes:

> On 4/14/2025 9:16 AM, Ankur Arora wrote:
>> This series adds multi-page clearing for hugepages. It is a rework
>> of [1] which took a detour through PREEMPT_LAZY [2].
>> Why multi-page clearing?: multi-page clearing improves upon the
>> current page-at-a-time approach by providing the processor with a
>> hint as to the real region size. A processor could use this hint to,
>> for instance, elide cacheline allocation when clearing a large
>> region.
>> This optimization in particular is done by REP; STOS on AMD Zen
>> where regions larger than L3-size use non-temporal stores.
>> This results in significantly better performance.
>> We also see performance improvement for cases where this optimization is
>> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
>> REP; STOS is typically microcoded which can now be amortized over
>> larger regions and the hint allows the hardware prefetcher to do a
>> better job.
>> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s  +- stddev)      (GB/s  +- stddev)
>>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
>> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
>>                   mm/folio_zero_user    x86/folio_zero_user     change
>>                    (GB/s +- stddev)      (GB/s +- stddev)
>>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
>>
> [...]
>
> Hello Ankur,
>
> Thank you for the patches. Was able to test briefly w/ lazy preempt
> mode.

Thanks for testing.

> (I do understand that, there could be lot of churn based on Ingo,
> Mateusz and others' comments)
> But here it goes:
>
> SUT: AMD EPYC 9B24 (Genoa) preempt=lazy
>
> metric = time taken in sec (lower is better). total SIZE=64GB
>                  mm/folio_zero_user    x86/folio_zero_user     change
>   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
>   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57


Just to translate it into the same units I was using above:

                  mm/folio_zero_user        x86/folio_zero_user
   pg-sz=1GB       25.91 GBps +-  0.38%    60.37 GBps +-  0.07%
   pg-sz=2MB       12.57 GBps +-  0.01%    25.39 GBps +-  0.36%

That's a decent improvement over Milan. Btw, are you using boost=1?

Also, any idea why the huge delta in the mm/folio_zero_user 2MB, 1GB
cases? Both of these are doing 4k page at a time, so the huge delta
is a little head scratching.

There's a gap on Milan as well but it is much smaller.

Thanks
Ankur

> More details (1G example run):
>
> base kernel    =   6.14 (preempt = lazy)
>
> mm/folio_zero_user
> Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):
>
>           2,476.47 msec task-clock                       #    1.002 CPUs
>          utilized            ( +-  0.39% )
>                  5      context-switches                 #    2.025 /sec ( +- 29.70% )
>                  2      cpu-migrations                   #    0.810 /sec ( +- 21.15% )
>                202      page-faults                      #   81.806 /sec ( +-  0.18% )
>      7,348,664,233      cycles                           #    2.976 GHz  ( +-  0.38% )  (38.39%)
>        878,805,326      stalled-cycles-frontend          #   11.99% frontend cycles idle     ( +-  0.74% )  (38.43%)
>        339,023,729      instructions                     #    0.05 insn per
>       cycle
>                                                   #    2.53  stalled cycles per
>                                                       insn  ( +-  0.08% )
>                                                       (38.47%)
>         88,579,915      branches                         #   35.873 M/sec
>        ( +-  0.06% )  (38.51%)
>         17,369,776      branch-misses                    #   19.55% of all
>        branches          ( +-  0.04% )  (38.55%)
>      2,261,339,695      L1-dcache-loads                  #  915.795 M/sec
>     ( +-  0.06% )  (38.56%)
>      1,073,880,164      L1-dcache-load-misses            #   47.48% of all
>     L1-dcache accesses  ( +-  0.05% )  (38.56%)
>          511,231,988      L1-icache-loads                  #  207.038 M/sec
>         ( +-  0.25% )  (38.52%)
>            128,533      L1-icache-load-misses            #    0.02% of all
>           L1-icache accesses  ( +-  0.40% )  (38.48%)
>             38,134      dTLB-loads                       #   15.443 K/sec
>            ( +-  4.22% )  (38.44%)
>             33,992      dTLB-load-misses                 #  114.39% of all dTLB
>            cache accesses  ( +-  9.42% )  (38.40%)
>                156      iTLB-loads                       #   63.177 /sec
>               ( +- 13.34% )  (38.36%)
>                156      iTLB-load-misses                 #  102.50% of all iTLB
>               cache accesses  ( +- 25.98% )  (38.36%)
>
>            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )
>
> x86/folio_zero_user
>           1,056.72 msec task-clock                       #    0.996 CPUs
>          utilized            ( +-  0.07% )
>                 10      context-switches                 #    9.436 /sec
>                ( +-  3.59% )
>                  3      cpu-migrations                   #    2.831 /sec
>                 ( +- 11.33% )
>                200      page-faults                      #  188.718 /sec
>               ( +-  0.15% )
>      3,146,571,264      cycles                           #    2.969 GHz
>      ( +-  0.07% )  (38.35%)
>         17,226,261      stalled-cycles-frontend          #    0.55% frontend
>        cycles idle     ( +-  4.12% )  (38.44%)
>         14,130,553      instructions                     #    0.00 insn per
>        cycle
>                                                   #    1.39  stalled cycles per
>                                                       insn  ( +-  1.59% )
>                                                       (38.53%)
>          3,578,614      branches                         #    3.377 M/sec
>         ( +-  1.54% )  (38.62%)
>            415,807      branch-misses                    #   12.45% of all
>           branches          ( +-  1.17% )  (38.62%)
>         22,208,699      L1-dcache-loads                  #   20.956 M/sec
>        ( +-  5.27% )  (38.60%)
>          7,312,684      L1-dcache-load-misses            #   27.79% of all
>         L1-dcache accesses  ( +-  8.46% )  (38.51%)
>            4,032,315      L1-icache-loads                  #    3.805 M/sec
>           ( +-  1.29% )  (38.48%)
>             15,094      L1-icache-load-misses            #    0.38% of all
>            L1-icache accesses  ( +-  1.14% )  (38.39%)
>             14,365      dTLB-loads                       #   13.555 K/sec
>            ( +-  7.23% )  (38.38%)
>              9,477      dTLB-load-misses                 #   65.36% of all dTLB
>             cache accesses  ( +- 12.05% )  (38.38%)
>                 18      iTLB-loads                       #   16.985 /sec
>                ( +- 34.84% )  (38.38%)
>                 67      iTLB-load-misses                 #  158.39% of all iTLB
>                cache accesses  ( +- 48.32% )  (38.32%)
>
>           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )
>
> Thanks and Regards
> - Raghu


--
ankur