linux-kernel - Re: [PATCH v3 0/4] mm/folio_zero

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0d6ba41c-0c90-4130-896a-26eabbd5bd24@amd.com>
Date: Tue, 22 Apr 2025 11:53:06 +0530
From: Raghavendra K T <raghavendra.kt@....com>
To: Ankur Arora <ankur.a.arora@...cle.com>, linux-kernel@...r.kernel.org,
 linux-mm@...ck.org, x86@...nel.org
Cc: torvalds@...ux-foundation.org, akpm@...ux-foundation.org, bp@...en8.de,
 dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
 luto@...nel.org, peterz@...radead.org, paulmck@...nel.org,
 rostedt@...dmis.org, tglx@...utronix.de, willy@...radead.org,
 jon.grimm@....com, bharata@....com, boris.ostrovsky@...cle.com,
 konrad.wilk@...cle.com
Subject: Re: [PATCH v3 0/4] mm/folio_zero_user: add multi-page clearing



On 4/14/2025 9:16 AM, Ankur Arora wrote:
> This series adds multi-page clearing for hugepages. It is a rework
> of [1] which took a detour through PREEMPT_LAZY [2].
> 
> Why multi-page clearing?: multi-page clearing improves upon the
> current page-at-a-time approach by providing the processor with a
> hint as to the real region size. A processor could use this hint to,
> for instance, elide cacheline allocation when clearing a large
> region.
> 
> This optimization in particular is done by REP; STOS on AMD Zen
> where regions larger than L3-size use non-temporal stores.
> 
> This results in significantly better performance.
> 
> We also see performance improvement for cases where this optimization is
> unavailable (pg-sz=2MB on AMD, and pg-sz=2MB|1GB on Intel) because
> REP; STOS is typically microcoded which can now be amortized over
> larger regions and the hint allows the hardware prefetcher to do a
> better job.
> 
> Milan (EPYC 7J13, boost=0, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s  +- stddev)      (GB/s  +- stddev)
> 
>    pg-sz=1GB       16.51  +- 0.54%        42.80  +-  3.48%    + 159.2%
>    pg-sz=2MB       11.89  +- 0.78%        16.12  +-  0.12%    +  35.5%
> 
> Icelakex (Platinum 8358, no_turbo=1, preempt=full|lazy):
> 
>                   mm/folio_zero_user    x86/folio_zero_user     change
>                    (GB/s +- stddev)      (GB/s +- stddev)
> 
>    pg-sz=1GB       8.01  +- 0.24%        11.26 +- 0.48%       + 40.57%
>    pg-sz=2MB       7.95  +- 0.30%        10.90 +- 0.26%       + 37.10%
> 
[...]

Hello Ankur,

Thank you for the patches. Was able to test briefly w/ lazy preempt
mode.

(I do understand that, there could be lot of churn based on Ingo,
Mateusz and others' comments)
But here it goes:

SUT: AMD EPYC 9B24 (Genoa) preempt=lazy

metric = time taken in sec (lower is better). total SIZE=64GB
                  mm/folio_zero_user    x86/folio_zero_user     change
   pg-sz=1GB       2.47044  +-  0.38%    1.060877  +-  0.07%    57.06
   pg-sz=2MB       5.098403 +-  0.01%    2.52015   +-  0.36%    50.57

More details (1G example run):

base kernel    =   6.14 (preempt = lazy)

mm/folio_zero_user
Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 runs):

           2,476.47 msec task-clock                       #    1.002 
CPUs utilized            ( +-  0.39% )
                  5      context-switches                 #    2.025 
/sec                     ( +- 29.70% )
                  2      cpu-migrations                   #    0.810 
/sec                     ( +- 21.15% )
                202      page-faults                      #   81.806 
/sec                     ( +-  0.18% )
      7,348,664,233      cycles                           #    2.976 GHz 
                      ( +-  0.38% )  (38.39%)
        878,805,326      stalled-cycles-frontend          #   11.99% 
frontend cycles idle     ( +-  0.74% )  (38.43%)
        339,023,729      instructions                     #    0.05 
insn per cycle
                                                   #    2.53  stalled 
cycles per insn  ( +-  0.08% )  (38.47%)
         88,579,915      branches                         #   35.873 
M/sec                    ( +-  0.06% )  (38.51%)
         17,369,776      branch-misses                    #   19.55% of 
all branches          ( +-  0.04% )  (38.55%)
      2,261,339,695      L1-dcache-loads                  #  915.795 
M/sec                    ( +-  0.06% )  (38.56%)
      1,073,880,164      L1-dcache-load-misses            #   47.48% of 
all L1-dcache accesses  ( +-  0.05% )  (38.56%)
          511,231,988      L1-icache-loads                  #  207.038 
M/sec                    ( +-  0.25% )  (38.52%)
            128,533      L1-icache-load-misses            #    0.02% of 
all L1-icache accesses  ( +-  0.40% )  (38.48%)
             38,134      dTLB-loads                       #   15.443 
K/sec                    ( +-  4.22% )  (38.44%)
             33,992      dTLB-load-misses                 #  114.39% of 
all dTLB cache accesses  ( +-  9.42% )  (38.40%)
                156      iTLB-loads                       #   63.177 
/sec                     ( +- 13.34% )  (38.36%)
                156      iTLB-load-misses                 #  102.50% of 
all iTLB cache accesses  ( +- 25.98% )  (38.36%)

            2.47044 +- 0.00949 seconds time elapsed  ( +-  0.38% )

x86/folio_zero_user
           1,056.72 msec task-clock                       #    0.996 
CPUs utilized            ( +-  0.07% )
                 10      context-switches                 #    9.436 
/sec                     ( +-  3.59% )
                  3      cpu-migrations                   #    2.831 
/sec                     ( +- 11.33% )
                200      page-faults                      #  188.718 
/sec                     ( +-  0.15% )
      3,146,571,264      cycles                           #    2.969 GHz 
                      ( +-  0.07% )  (38.35%)
         17,226,261      stalled-cycles-frontend          #    0.55% 
frontend cycles idle     ( +-  4.12% )  (38.44%)
         14,130,553      instructions                     #    0.00 
insn per cycle
                                                   #    1.39  stalled 
cycles per insn  ( +-  1.59% )  (38.53%)
          3,578,614      branches                         #    3.377 
M/sec                    ( +-  1.54% )  (38.62%)
            415,807      branch-misses                    #   12.45% of 
all branches          ( +-  1.17% )  (38.62%)
         22,208,699      L1-dcache-loads                  #   20.956 
M/sec                    ( +-  5.27% )  (38.60%)
          7,312,684      L1-dcache-load-misses            #   27.79% of 
all L1-dcache accesses  ( +-  8.46% )  (38.51%)
            4,032,315      L1-icache-loads                  #    3.805 
M/sec                    ( +-  1.29% )  (38.48%)
             15,094      L1-icache-load-misses            #    0.38% of 
all L1-icache accesses  ( +-  1.14% )  (38.39%)
             14,365      dTLB-loads                       #   13.555 
K/sec                    ( +-  7.23% )  (38.38%)
              9,477      dTLB-load-misses                 #   65.36% of 
all dTLB cache accesses  ( +- 12.05% )  (38.38%)
                 18      iTLB-loads                       #   16.985 
/sec                     ( +- 34.84% )  (38.38%)
                 67      iTLB-load-misses                 #  158.39% of 
all iTLB cache accesses  ( +- 48.32% )  (38.32%)

           1.060877 +- 0.000766 seconds time elapsed  ( +-  0.07% )

Thanks and Regards
- Raghu