linux-kernel - Re: [PATCH v2 0/9] x86/clear_huge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <2b79ab3b-56e7-926f-49f0-4c2584f6a72b@amd.com>
Date:   Tue, 5 Sep 2023 06:36:33 +0530
From:   Raghavendra K T <raghavendra.kt@....com>
To:     Ankur Arora <ankur.a.arora@...cle.com>,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc:     akpm@...ux-foundation.org, luto@...nel.org, bp@...en8.de,
        dave.hansen@...ux.intel.com, hpa@...or.com, mingo@...hat.com,
        juri.lelli@...hat.com, vincent.guittot@...aro.org,
        willy@...radead.org, mgorman@...e.de, peterz@...radead.org,
        rostedt@...dmis.org, tglx@...utronix.de, jon.grimm@....com,
        bharata@....com, boris.ostrovsky@...cle.com, konrad.wilk@...cle.com
Subject: Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

On 8/31/2023 12:19 AM, Ankur Arora wrote:
> This series adds a multi-page clearing primitive, clear_pages(),
> which enables more effective use of x86 string instructions by
> advertising the real region-size to be cleared.
> 
> Region-size can be used as a hint by uarchs to optimize the
> clearing.
> 
> Also add allow_resched() which marks a code-section as allowing
> rescheduling in the irqentry_exit path. This allows clear_pages()
> to get by without having to call cond_sched() periodically.
> (preempt_model_full() already handles this via
> irqentry_exit_cond_resched(), so we handle this similarly for
> preempt_model_none() and preempt_model_voluntary().)
> 
> 

Hello Ankur,
Thansk for the patches.

I tried the patches, Improvements look similar to V1 (even without
circuitous chunk optimizations.)
STill we see similar 50-60% improvement for 1G and 2M page sizes.


SUT: Bergamo
     CPU family:          25
     Model:               160
     Thread(s) per core:  2
     Core(s) per socket:  128
     Socket(s):           2

NUMA:
   NUMA node(s):          2
   NUMA node0 CPU(s):     0-127,256-383
   NUMA node1 CPU(s):     128-255,384-511

Test:  Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA 
node0), for both base-hugepage-size=2M and 1GB
Current result is with thp = always, but madv also did not make much 
difference.

perf stat -r 10 -d -d  numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
base: mm/clear_huge_page
patched: x86/clear_huge_page

page-size  base       patched     Improvement %
2M         5.0779     2.50623     50.64
1G         2.50623    1.012439    59.60

More details:

  Performance counter stats for 'mm/map_hugetlb' (10 runs):

           5,058.71 msec task-clock                #    0.996 CPUs 
utilized            ( +-  0.26% )
                  8      context-switches          #    1.576 /sec 
                ( +-  7.23% )
                  0      cpu-migrations            #    0.000 /sec
             32,917      page-faults               #    6.484 K/sec 
                ( +-  0.00% )
     15,797,804,067      cycles                    #    3.112 GHz 
                ( +-  0.26% )  (35.70%)
          2,073,754      stalled-cycles-frontend   #    0.01% frontend 
cycles idle     ( +-  1.25% )  (35.71%)
         27,508,977      stalled-cycles-backend    #    0.17% backend 
cycles idle      ( +-  9.48% )  (35.74%)
      1,143,710,651      instructions              #    0.07  insn per cycle
                                                   #    0.03  stalled 
cycles per insn  ( +-  0.15% )  (35.76%)
        243,817,330      branches                  #   48.028 M/sec 
                ( +-  0.12% )  (35.78%)
            357,760      branch-misses             #    0.15% of all 
branches          ( +-  1.52% )  (35.75%)
      2,540,733,497      L1-dcache-loads           #  500.483 M/sec 
                ( +-  0.04% )  (35.74%)
      1,093,660,557      L1-dcache-load-misses     #   42.98% of all 
L1-dcache accesses  ( +-  0.03% )  (35.71%)
         73,335,478      L1-icache-loads           #   14.446 M/sec 
                ( +-  0.08% )  (35.70%)
            878,378      L1-icache-load-misses     #    1.19% of all 
L1-icache accesses  ( +-  2.65% )  (35.68%)
          1,025,714      dTLB-loads                #  202.049 K/sec 
                ( +-  2.70% )  (35.69%)
            405,407      dTLB-load-misses          #   37.35% of all 
dTLB cache accesses  ( +-  1.59% )  (35.68%)
                  2      iTLB-loads                #    0.394 /sec 
                ( +- 41.63% )  (35.68%)
             40,356      iTLB-load-misses          # 1552153.85% of all 
iTLB cache accesses  ( +-  7.18% )  (35.68%)

             5.0779 +- 0.0132 seconds time elapsed  ( +-  0.26% )

  Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb' (10 
runs):

           2,538.40 msec task-clock                #    1.013 CPUs 
utilized            ( +-  0.27% )
                  4      context-switches          #    1.597 /sec 
                ( +-  6.51% )
                  1      cpu-migrations            #    0.399 /sec
             32,916      page-faults               #   13.140 K/sec 
                ( +-  0.00% )
      7,901,830,782      cycles                    #    3.154 GHz 
                ( +-  0.27% )  (35.67%)
          6,590,473      stalled-cycles-frontend   #    0.08% frontend 
cycles idle     ( +- 10.31% )  (35.71%)
        329,970,288      stalled-cycles-backend    #    4.23% backend 
cycles idle      ( +- 13.65% )  (35.74%)
        725,811,962      instructions              #    0.09  insn per cycle
                                                   #    0.80  stalled 
cycles per insn  ( +-  0.37% )  (35.78%)
        132,182,704      branches                  #   52.767 M/sec 
                ( +-  0.26% )  (35.82%)
            254,163      branch-misses             #    0.19% of all 
branches          ( +-  2.47% )  (35.81%)
      2,382,927,453      L1-dcache-loads           #  951.262 M/sec 
                ( +-  0.04% )  (35.77%)
      1,082,022,067      L1-dcache-load-misses     #   45.41% of all 
L1-dcache accesses  ( +-  0.02% )  (35.74%)
         47,164,491      L1-icache-loads           #   18.828 M/sec 
                ( +-  0.37% )  (35.70%)
            474,535      L1-icache-load-misses     #    0.99% of all 
L1-icache accesses  ( +-  2.93% )  (35.66%)
          1,477,334      dTLB-loads                #  589.750 K/sec 
                ( +-  5.12% )  (35.65%)
            624,125      dTLB-load-misses          #   56.24% of all 
dTLB cache accesses  ( +-  5.66% )  (35.65%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.65%)
              1,626      iTLB-load-misses          # 7069.57% of all 
iTLB cache accesses  ( +-283.51% )  (35.65%)

            2.50623 +- 0.00691 seconds time elapsed  ( +-  0.28% )


  Performance counter stats for 'numactl -m 0 -N 0 mm/map_hugetlb_1G' 
(10 runs):


           2,506.50 msec task-clock                #    0.995 CPUs 
utilized            ( +-  0.17% )
                  4      context-switches          #    1.589 /sec 
                ( +-  9.28% )
                  0      cpu-migrations            #    0.000 /sec
                214      page-faults               #   84.997 /sec 
                ( +-  0.13% )
      7,821,519,053      cycles                    #    3.107 GHz 
                ( +-  0.17% )  (35.72%)
          2,037,744      stalled-cycles-frontend   #    0.03% frontend 
cycles idle     ( +- 25.62% )  (35.73%)
          6,578,899      stalled-cycles-backend    #    0.08% backend 
cycles idle      ( +-  2.65% )  (35.73%)
        468,648,780      instructions              #    0.06  insn per cycle
                                                   #    0.01  stalled 
cycles per insn  ( +-  0.10% )  (35.73%)
        116,267,370      branches                  #   46.179 M/sec 
                ( +-  0.08% )  (35.73%)
            111,966      branch-misses             #    0.10% of all 
branches          ( +-  2.98% )  (35.72%)
      2,294,727,165      L1-dcache-loads           #  911.424 M/sec 
                ( +-  0.02% )  (35.71%)
      1,076,156,463      L1-dcache-load-misses     #   46.88% of all 
L1-dcache accesses  ( +-  0.01% )  (35.70%)
         26,093,151      L1-icache-loads           #   10.364 M/sec 
                ( +-  0.21% )  (35.71%)
            132,944      L1-icache-load-misses     #    0.51% of all 
L1-icache accesses  ( +-  0.55% )  (35.70%)
             30,925      dTLB-loads                #   12.283 K/sec 
                ( +-  5.70% )  (35.71%)
             27,437      dTLB-load-misses          #   86.22% of all 
dTLB cache accesses  ( +-  1.98% )  (35.70%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.71%)
                 11      iTLB-load-misses          #   62.50% of all 
iTLB cache accesses  ( +-140.21% )  (35.70%)

            2.51890 +- 0.00433 seconds time elapsed  ( +-  0.17% )

  Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb_1G' 
(10 runs):

           1,013.59 msec task-clock                #    1.001 CPUs 
utilized            ( +-  0.07% )
                  2      context-switches          #    1.978 /sec 
                ( +- 12.91% )
                  1      cpu-migrations            #    0.989 /sec
                213      page-faults               #  210.634 /sec 
                ( +-  0.17% )
      3,169,391,694      cycles                    #    3.134 GHz 
                ( +-  0.07% )  (35.53%)
            109,925      stalled-cycles-frontend   #    0.00% frontend 
cycles idle     ( +-  5.56% )  (35.63%)
        950,638,913      stalled-cycles-backend    #   30.06% backend 
cycles idle      ( +-  5.06% )  (35.73%)
         51,189,571      instructions              #    0.02  insn per cycle
                                                   #   21.03  stalled 
cycles per insn  ( +-  1.22% )  (35.82%)
          9,545,941      branches                  #    9.440 M/sec 
                ( +-  1.50% )  (35.92%)
             86,836      branch-misses             #    0.88% of all 
branches          ( +-  3.74% )  (36.00%)
         46,109,587      L1-dcache-loads           #   45.597 M/sec 
                ( +-  3.92% )  (35.96%)
         13,796,172      L1-dcache-load-misses     #   41.77% of all 
L1-dcache accesses  ( +-  4.81% )  (35.85%)
          1,179,166      L1-icache-loads           #    1.166 M/sec 
                ( +-  1.22% )  (35.77%)
             21,528      L1-icache-load-misses     #    1.90% of all 
L1-icache accesses  ( +-  1.85% )  (35.66%)
             14,529      dTLB-loads                #   14.368 K/sec 
                ( +-  4.65% )  (35.57%)
              8,505      dTLB-load-misses          #   67.88% of all 
dTLB cache accesses  ( +-  5.61% )  (35.52%)
                  0      iTLB-loads                #    0.000 /sec 
                (35.52%)
                  8      iTLB-load-misses          #    0.00% of all 
iTLB cache accesses  ( +-267.99% )  (35.52%)

           1.012439 +- 0.000723 seconds time elapsed  ( +-  0.07% )


Please feel free to carry:

Tested-by: Raghavendra K T <raghavendra.kt@....com>
for any minor changes.

Thanks and Regards
- Raghu