linux-kernel - Re: [PATCH v2 0/9] x86/clear_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5570c6b9-4abd-1526-cd17-ed45f7d51b20@amd.com>
Date:   Fri, 8 Sep 2023 07:48:16 +0530
From:   Raghavendra K T <raghavendra.kt@....com>
To:     Mateusz Guzik <mjguzik@...il.com>,
        Ankur Arora <ankur.a.arora@...cle.com>
Cc:     linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org,
        akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, peterz@...radead.org,
        rostedt@...dmis.org, tglx@...utronix.de, jon.grimm@....com,
        bharata@....com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

On 9/3/2023 1:44 PM, Mateusz Guzik wrote:
> On Wed, Aug 30, 2023 at 11:49:49AM -0700, Ankur Arora wrote:
>> This series adds a multi-page clearing primitive, clear_pages(),
>> which enables more effective use of x86 string instructions by
>> advertising the real region-size to be cleared.
>>
>> Region-size can be used as a hint by uarchs to optimize the
>> clearing.
>>
>> Also add allow_resched() which marks a code-section as allowing
>> rescheduling in the irqentry_exit path. This allows clear_pages()
>> to get by without having to call cond_sched() periodically.
>> (preempt_model_full() already handles this via
>> irqentry_exit_cond_resched(), so we handle this similarly for
>> preempt_model_none() and preempt_model_voluntary().)
>>
>> Performance
>> ==
>>
>> With this demand fault performance gets a decent increase:
>>
>>    *Milan*     mm/clear_huge_page   x86/clear_huge_page   change
>>                            (GB/s)                (GB/s)
>>                                                                     
>>    pg-sz=2MB                14.55                 19.29    +32.5%
>>    pg-sz=1GB                19.34                 49.60   +156.4%
>>
>> Milan (and some other AMD Zen uarchs tested) take advantage of the
>> hint to elide cacheline allocation for pg-sz=1GB. The cut-off for
>> this optimization seems to be at around region-size > LLC-size so
>> the pg-sz=2MB load still allocates cachelines.
>>
> 
> Have you benchmarked clzero? It is an AMD-specific instruction issuing
> non-temporal stores. It is definitely something to try out for 1G pages.
> 
> One would think rep stosq has to be at least not worse since the CPU is
> explicitly told what to do and is free to optimize it however it sees
> fit, but the rep prefix has a long history of underperforming.
> 
> I'm not saying it is going to be better, but that this should be tested,
> albeit one can easily argue this can be done at a later date.
> 
> I would do it myself but my access to AMD CPUs is limited.
> 

Hello Mateuz,

I plugged in CLZERO unconditionally (even for coherent path with
sfence) for my earlier experimets on top of this series.

Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA
node0),  for both base-hugepage-size=2M and 1GB

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

SUT: AMD Bergamo with 2 node/2 socket  128 cores per socket.

 From that I see time taken is:
for 2M:  1.092125
for 1G:  0.997661

So overall for 64GB size experiment result look like this:
Time taken for 64GB region, (lesser = better)

  page-size  base       patched (gain%)    patched-clzero (gain%)
  2M         5.0779     2.50623  (50.64)          1.092125 (78)
  1G         2.50623    1.012439 (59.60)          0.997661 (60)

In summary I further see improvements for even for 2M base size (2.5x).

Overall CLZERO clearing is promising. But we may need threshold tuning
and hint passing as done in Ankurs'
Link: 
https://lore.kernel.org/lkml/20220606202109.1306034-1-ankur.a.arora@oracle.com/
on top of current series.

I need to experiment with different chunk size as well as base size
further. (both clzero and rep stos)

Thanks and Regards
- Raghu

Run Details:
  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb_1G' (10 
runs):

             996.34 msec task-clock                #    0.999 CPUs 
utilized            ( +-  0.02% )
                  2      context-switches          #    2.007 /sec 
                ( +- 21.34% )
                  0      cpu-migrations            #    0.000 /sec
                212      page-faults               #  212.735 /sec 
                ( +-  0.20% )
      3,116,497,471      cycles                    #    3.127 GHz 
                ( +-  0.02% )  (35.66%)
            100,343      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +- 16.85% )  (35.75%)
          1,369,118      stalled-cycles-backend    #    0.04% backend 
cycles idle      ( +-  3.45% )  (35.86%)
      4,325,987,025      instructions              #    1.39  insn per cycle
                                                   #    0.00  stalled 
cycles per insn  ( +-  0.02% )  (35.87%)
      1,078,119,163      branches                  #    1.082 G/sec 
                ( +-  0.01% )  (35.87%)
             87,907      branch-misses             #    0.01% of all 
branches          ( +-  5.22% )  (35.83%)
         12,337,100      L1-dcache-loads           #   12.380 M/sec 
                ( +-  5.44% )  (35.74%)
            280,300      L1-dcache-load-misses     #    2.48% of all 
L1-dcache accesses  ( +-  5.74% )  (35.64%)
           1,464,549      L1-icache-loads           #    1.470 M/sec 
                 ( +-  1.61% )  (35.63%)
             30,659      L1-icache-load-misses     #    2.12% of all 
L1-icache accesses  ( +-  3.30% )  (35.62%)
             17,366      dTLB-loads                #   17.426 K/sec 
                ( +-  5.52% )  (35.63%)
             11,774      dTLB-load-misses          #   81.79% of all 
dTLB cache accesses  ( +-  7.94% )  (35.63%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.63%)
                  2      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +-342.39% )  (35.64%)

           0.997661 +- 0.000150 seconds time elapsed  ( +-  0.02% )


  Performance counter stats for 'numactl -m 0 -N 0 map_hugetlb' (10 runs):

           1,089.97 msec task-clock                #    0.998 CPUs 
utilized            ( +-  0.03% )
                  3      context-switches          #    2.750 /sec 
                ( +- 15.11% )
                  0      cpu-migrations            #    0.000 /sec
             32,917      page-faults               #   30.172 K/sec 
                ( +-  0.00% )
      3,408,713,422      cycles                    #    3.124 GHz 
                ( +-  0.03% )  (35.60%)
            982,417      stalled-cycles-frontend   #    0.03% frontend 
cycles idle     ( +-  2.77% )  (35.60%)
          8,495,409      stalled-cycles-backend    #    0.25% backend 
cycles idle      ( +-  6.12% )  (35.59%)
      4,970,939,278      instructions              #    1.46  insn per cycle
                                                   #    0.00  stalled 
cycles per insn  ( +-  0.04% )  (35.64%)
      1,196,644,653      branches                  #    1.097 G/sec 
                ( +-  0.03% )  (35.73%)
            196,584      branch-misses             #    0.02% of all 
branches          ( +-  2.79% )  (35.78%)
        226,254,284      L1-dcache-loads           #  207.388 M/sec 
                ( +-  0.23% )  (35.78%)
          1,161,607      L1-dcache-load-misses     #    0.52% of all 
L1-dcache accesses  ( +-  3.27% )  (35.78%)
          21,757,775      L1-icache-loads           #   19.943 M/sec 
                 ( +-  0.66% )  (35.77%)
            165,503      L1-icache-load-misses     #    0.78% of all 
L1-icache accesses  ( +-  3.11% )  (35.78%)
          1,118,573      dTLB-loads                #    1.025 M/sec 
                ( +-  1.38% )  (35.78%)
            415,943      dTLB-load-misses          #   37.10% of all 
dTLB cache accesses  ( +-  1.12% )  (35.78%)
                 36      iTLB-loads                #   32.998 /sec 
                ( +- 18.47% )  (35.74%)
             49,785      iTLB-load-misses          # 270570.65% of all 
iTLB cache accesses  ( +-  0.34% )  (35.65%)

           1.092125 +- 0.000350 seconds time elapsed  ( +-  0.03% )