linux-kernel - [PATCH 6/9] mm/clear_huge_page: use multi-page clearing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20230403052233.1880567-7-ankur.a.arora@oracle.com>
Date:   Sun,  2 Apr 2023 22:22:30 -0700
From:   Ankur Arora <ankur.a.arora@...cle.com>
To:     linux-kernel@...r.kernel.org, linux-mm@...ck.org, x86@...nel.org
Cc:     torvalds@...ux-foundation.org, akpm@...ux-foundation.org,
        luto@...nel.org, bp@...en8.de, dave.hansen@...ux.intel.com,
        hpa@...or.com, mingo@...hat.com, juri.lelli@...hat.com,
        willy@...radead.org, mgorman@...e.de, peterz@...radead.org,
        rostedt@...dmis.org, tglx@...utronix.de,
        vincent.guittot@...aro.org, jon.grimm@....com, bharata@....com,
        boris.ostrovsky@...cle.com, konrad.wilk@...cle.com,
        ankur.a.arora@...cle.com
Subject: [PATCH 6/9] mm/clear_huge_page: use multi-page clearing

clear_pages_rep(), clear_pages_erms() use string instructions
internally. These, unlike a MOV loop, allow us to explicitly advertise
the region-size to the processor. Thus, clearing in multi-page chunks
means we can specify the real region sizes (or close to it) which is
good for two reasons:

 - region-size can serve as a hint to current (some AMD Zen models) and
   possibly future uarchs which can use this hint to avoid polluting one
   or more levels of the dcache.

 - string instructions are typically microcoded, and would be cheaper
   if amortized across larger regions. We also execute fewer loop
   iterations (ex. a cond_resched() check for each page but those
   instructions are likely free.)

clear_huge_page() now clears in three sections: the local neighbourhood
of the faulting address (faulting page, and four surrounding pages),
and its left and right regions.

The local neighbourhood is cleared last to keep its cachelines hot.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 64GB region (on the local
NUMA node):

Icelakex (Platinum 8358, ucode=0xd0002c1, no_turbo=1):

              mm/clear_huge_page   x86/clear_huge_page   change   
                          (GB/s)                (GB/s)            
                                                                  
  pg-sz=2MB                 8.76                 11.82   +34.93%  
  pg-sz=1GB                 8.99                 12.18   +35.48%  

On Icelakex we continue to allocate cachelines:

pg-sz=2MB:
    -   701,951,397      L1-dcache-loads           #   47.985 M/sec                       ( +- 19.22% )  (69.23%)
    - 3,239,403,770      L1-dcache-load-misses     #  691.17% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
    +   194,318,641      L1-dcache-loads           #   17.905 M/sec                       ( +- 19.07% )  (69.25%)
    + 3,238,878,229      L1-dcache-load-misses     # 2480.93% of all L1-dcache accesses   ( +- 19.25% )  (69.26%)

pg-sz=1GB:
    -   532,232,051      L1-dcache-loads           #   37.378 M/sec                       ( +- 19.25% )  (69.23%)
    - 3,224,574,249      L1-dcache-load-misses     #  909.02% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
    +    22,587,703      L1-dcache-loads           #    2.150 M/sec                       ( +- 19.38% )  (69.25%)
    + 3,223,143,697      L1-dcache-load-misses     # 21478.37% of all L1-dcache accesses  ( +- 19.25% )  (69.25%)


Milan (EPYC 7J13, ucode=0xa0011a9, boost=0):

              mm/clear_huge_page   x86/clear_huge_page   change    
                          (GB/s)                (GB/s)             
                                                                   
  pg-sz=2MB                12.24                 17.54    +43.30%  
  pg-sz=1GB                17.98                 37.24   +107.11%  

Milan uses a threshold ~32MB for eliding cacheline allocation, so we
see a dropoff in cacheline-allocations for pg-sz=1GB:

pg-sz=2MB:
    - 2,495,566,569      L1-dcache-loads           #  476.417 M/sec                      ( +-  0.04% )  (33.38%)
    - 1,079,711,798      L1-dcache-load-misses     #   43.28% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)
    + 2,235,310,058      L1-dcache-loads           #  610.770 M/sec                      ( +-  0.02% )  (33.37%)
    + 1,089,602,355      L1-dcache-load-misses     #   48.73% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)

pg-sz=1GB:
    - 2,417,846,489      L1-dcache-loads           #  679.753 M/sec                      ( +-  0.01% )  (33.38%)
    - 1,075,531,869      L1-dcache-load-misses     #   44.49% of all L1-dcache accesses  ( +-  0.01% )  (33.35%)
    +    31,159,378      L1-dcache-loads           #   18.119 M/sec                      ( +-  3.27% )  (33.46%)
    +    14,692,358      L1-dcache-load-misses     #   48.21% of all L1-dcache accesses  ( +-  3.12% )  (33.46%)

Signed-off-by: Ankur Arora <ankur.a.arora@...cle.com>
---

Fuller perf stats for context:

# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

         21,945.59 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.25% )
                34      context-switches                 #    2.324 /sec                        ( +- 20.38% )
                 3      cpu-migrations                   #    0.205 /sec                        ( +- 19.25% )
           198,152      page-faults                      #   13.546 K/sec                       ( +- 19.29% )
    56,513,364,885      cycles                           #    3.863 GHz                         ( +- 19.25% )  (38.44%)
     2,583,719,806      instructions                     #    0.07  insn per cycle              ( +- 19.24% )  (46.14%)
       585,212,952      branches                         #   40.005 M/sec                       ( +- 19.23% )  (53.83%)
           562,164      branch-misses                    #    0.14% of all branches             ( +- 19.23% )  (61.53%)
   282,621,312,162      slots                            #   19.320 G/sec                       ( +- 19.25% )  (69.22%)
    11,048,627,225      topdown-retiring                 #      3.8% Retiring                   ( +- 19.22% )  (69.22%)
    34,358,400,894      topdown-bad-spec                 #     11.5% Bad Speculation            ( +- 19.57% )  (69.22%)
     2,231,092,499      topdown-fe-bound                 #      0.8% Frontend Bound             ( +- 19.25% )  (69.22%)
   246,679,210,776      topdown-be-bound                 #     84.0% Backend Bound              ( +- 19.21% )  (69.22%)
       701,951,397      L1-dcache-loads                  #   47.985 M/sec                       ( +- 19.22% )  (69.23%)
     3,239,403,770      L1-dcache-load-misses            #  691.17% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
        11,475,685      LLC-loads                        #  784.475 K/sec                       ( +- 19.23% )  (69.25%)
           793,272      LLC-load-misses                  #   10.36% of all LL-cache accesses    ( +- 19.23% )  (69.25%)
        17,821,045      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.51% )  (30.77%)
       693,339,354      dTLB-loads                       #   47.397 M/sec                       ( +- 19.33% )  (30.76%)
           637,811      dTLB-load-misses                 #    0.14% of all dTLB cache accesses  ( +- 19.09% )  (30.75%)
           131,922      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.59% )  (30.75%)

           7.31681 +- 0.00177 seconds time elapsed  ( +-  0.02% )


# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

         16,276.28 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.24% )
                27      context-switches                 #    2.488 /sec                        ( +- 19.25% )
                 3      cpu-migrations                   #    0.276 /sec                        ( +- 19.25% )
           196,935      page-faults                      #   18.146 K/sec                       ( +- 19.25% )
    41,906,597,608      cycles                           #    3.861 GHz                         ( +- 19.24% )  (38.44%)
       729,479,932      instructions                     #    0.03  insn per cycle              ( +- 19.38% )  (46.14%)
       133,969,095      branches                         #   12.344 M/sec                       ( +- 19.35% )  (53.84%)
           412,818      branch-misses                    #    0.46% of all branches             ( +- 18.97% )  (61.54%)
   209,574,316,961      slots                            #   19.311 G/sec                       ( +- 19.24% )  (69.24%)
     4,933,512,982      topdown-retiring                 #      2.3% Retiring                   ( +- 19.24% )  (69.24%)
    20,272,641,267      topdown-bad-spec                 #      9.4% Bad Speculation            ( +- 19.51% )  (69.24%)
       837,421,487      topdown-fe-bound                 #      0.4% Frontend Bound             ( +- 19.24% )  (69.24%)
   190,089,232,476      topdown-be-bound                 #     88.0% Backend Bound              ( +- 19.19% )  (69.24%)
       194,318,641      L1-dcache-loads                  #   17.905 M/sec                       ( +- 19.07% )  (69.25%)
     3,238,878,229      L1-dcache-load-misses            # 2480.93% of all L1-dcache accesses   ( +- 19.25% )  (69.26%)
        10,560,508      LLC-loads                        #  973.081 K/sec                       ( +- 19.23% )  (69.26%)
           724,884      LLC-load-misses                  #   10.28% of all LL-cache accesses    ( +- 17.15% )  (69.26%)
        14,378,070      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.13% )  (30.75%)
       185,562,230      dTLB-loads                       #   17.098 M/sec                       ( +- 19.74% )  (30.74%)
           617,978      dTLB-load-misses                 #    0.51% of all dTLB cache accesses  ( +- 18.72% )  (30.74%)
           112,509      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.76% )  (30.74%)

           5.42697 +- 0.00152 seconds time elapsed  ( +-  0.03% )


# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):

         21,361.22 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.25% )
                23      context-switches                 #    1.615 /sec                        ( +- 18.95% )
                 3      cpu-migrations                   #    0.211 /sec                        ( +- 19.25% )
               701      page-faults                      #   49.230 /sec                        ( +- 19.27% )
    54,981,958,487      cycles                           #    3.861 GHz                         ( +- 19.25% )  (38.44%)
     2,012,625,953      instructions                     #    0.05  insn per cycle              ( +- 19.25% )  (46.14%)
       470,264,509      branches                         #   33.026 M/sec                       ( +- 19.25% )  (53.83%)
           194,801      branch-misses                    #    0.06% of all branches             ( +- 18.88% )  (61.53%)
   274,966,507,627      slots                            #   19.311 G/sec                       ( +- 19.25% )  (69.22%)
    10,555,137,650      topdown-retiring                 #      3.8% Retiring                   ( +- 19.04% )  (69.22%)
    21,206,785,918      topdown-bad-spec                 #      7.8% Bad Speculation            ( +- 18.13% )  (69.22%)
     1,094,597,329      topdown-fe-bound                 #      0.4% Frontend Bound             ( +- 19.25% )  (69.22%)
   244,462,123,545      topdown-be-bound                 #     88.0% Backend Bound              ( +- 19.33% )  (69.22%)
       532,232,051      L1-dcache-loads                  #   37.378 M/sec                       ( +- 19.25% )  (69.23%)
     3,224,574,249      L1-dcache-load-misses            #  909.02% of all L1-dcache accesses   ( +- 19.25% )  (69.24%)
         2,318,195      LLC-loads                        #  162.804 K/sec                       ( +- 19.35% )  (69.25%)
           206,737      LLC-load-misses                  #   13.44% of all LL-cache accesses    ( +- 18.30% )  (69.25%)
         4,950,866      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.26% )  (30.77%)
       531,299,560      dTLB-loads                       #   37.313 M/sec                       ( +- 19.24% )  (30.76%)
             2,811      dTLB-load-misses                 #    0.00% of all dTLB cache accesses  ( +- 17.25% )  (30.75%)
            26,355      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.58% )  (30.75%)

           7.12187 +- 0.00190 seconds time elapsed  ( +-  0.03% )


# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):

         15,764.52 msec task-clock                       #    2.999 CPUs utilized               ( +- 19.25% )
                17      context-switches                 #    1.618 /sec                        ( +- 20.47% )
                 3      cpu-migrations                   #    0.285 /sec                        ( +- 19.25% )
               700      page-faults                      #   66.614 /sec                        ( +- 19.22% )
    40,560,984,582      cycles                           #    3.860 GHz                         ( +- 19.25% )  (38.45%)
        79,578,792      instructions                     #    0.00  insn per cycle              ( +- 19.24% )  (46.15%)
        13,872,134      branches                         #    1.320 M/sec                       ( +- 19.23% )  (53.85%)
           119,492      branch-misses                    #    1.29% of all branches             ( +- 18.80% )  (61.55%)
   202,854,573,160      slots                            #   19.304 G/sec                       ( +- 19.25% )  (69.25%)
     3,982,417,725      topdown-retiring                 #      2.0% Retiring                   ( +- 19.25% )  (69.25%)
    13,523,424,635      topdown-bad-spec                 #      6.8% Bad Speculation            ( +- 18.69% )  (69.25%)
        18,661,431      topdown-fe-bound                 #      0.0% Frontend Bound             ( +- 19.28% )  (69.25%)
   185,884,147,789      topdown-be-bound                 #     91.3% Backend Bound              ( +- 19.28% )  (69.25%)
        22,587,703      L1-dcache-loads                  #    2.150 M/sec                       ( +- 19.38% )  (69.25%)
     3,223,143,697      L1-dcache-load-misses            # 21478.37% of all L1-dcache accesses  ( +- 19.25% )  (69.25%)
         1,777,675      LLC-loads                        #  169.169 K/sec                       ( +- 19.60% )  (69.25%)
           126,583      LLC-load-misses                  #   10.77% of all LL-cache accesses    ( +- 19.82% )  (69.25%)
         3,333,729      L1-icache-load-misses            #    0.00% of all L1-icache accesses   ( +- 19.49% )  (30.75%)
        19,999,517      dTLB-loads                       #    1.903 M/sec                       ( +- 19.38% )  (30.75%)
             1,833      dTLB-load-misses                 #    0.01% of all dTLB cache accesses  ( +- 17.72% )  (30.75%)
            34,066      iTLB-load-misses                 #    0.00% of all iTLB cache accesses  ( +- 19.09% )  (30.75%)

           5.25624 +- 0.00176 seconds time elapsed  ( +-  0.03% )


# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

          5,241.76 msec task-clock                #    1.000 CPUs utilized            ( +-  0.08% )
                10      context-switches          #    1.909 /sec                     ( +-  8.82% )
                 1      cpu-migrations            #    0.191 /sec                   
            65,636      page-faults               #   12.530 K/sec                    ( +-  0.00% )
    12,730,694,768      cycles                    #    2.430 GHz                      ( +-  0.08% )  (33.31%)
        36,709,243      stalled-cycles-frontend   #    0.29% frontend cycles idle     ( +- 24.07% )  (33.32%)
        37,520,225      stalled-cycles-backend    #    0.29% backend cycles idle      ( +-  9.87% )  (33.34%)
       874,896,010      instructions              #    0.07  insn per cycle         
                                                  #    0.05  stalled cycles per insn  ( +-  1.23% )  (33.36%)
       199,308,386      branches                  #   38.049 M/sec                    ( +-  0.84% )  (33.38%)
           441,428      branch-misses             #    0.22% of all branches          ( +-  4.68% )  (33.38%)
     2,495,566,569      L1-dcache-loads           #  476.417 M/sec                    ( +-  0.04% )  (33.38%)
     1,079,711,798      L1-dcache-load-misses     #   43.28% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)
        50,936,391      L1-icache-loads           #    9.724 M/sec                    ( +-  1.29% )  (33.35%)
           284,407      L1-icache-load-misses     #    0.56% of all L1-icache accesses  ( +-  4.60% )  (33.33%)
           546,596      dTLB-loads                #  104.348 K/sec                    ( +-  6.14% )  (33.31%)
           231,897      dTLB-load-misses          #   42.08% of all dTLB cache accesses  ( +-  4.27% )  (33.29%)
                 6      iTLB-loads                #    1.145 /sec                     ( +- 72.65% )  (33.29%)
            34,065      iTLB-load-misses          # 262038.46% of all iTLB cache accesses  ( +- 44.88% )  (33.29%)
        18,237,487      L1-dcache-prefetches      #    3.482 M/sec                    ( +- 12.84% )  (33.29%)

           5.23915 +- 0.00421 seconds time elapsed  ( +-  0.08% )

# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

          3,655.71 msec task-clock                #    0.999 CPUs utilized            ( +-  0.13% )
                 7      context-switches          #    1.913 /sec                     ( +-  8.25% )
                 1      cpu-migrations            #    0.273 /sec                   
            65,636      page-faults               #   17.934 K/sec                    ( +-  0.00% )
     8,879,727,514      cycles                    #    2.426 GHz                      ( +-  0.13% )  (33.26%)
         5,733,380      stalled-cycles-frontend   #    0.06% frontend cycles idle     ( +-170.04% )  (33.28%)
        42,012,302      stalled-cycles-backend    #    0.47% backend cycles idle      ( +- 24.51% )  (33.31%)
       214,672,610      instructions              #    0.02  insn per cycle         
                                                  #    0.28  stalled cycles per insn  ( +-  1.71% )  (33.34%)
        42,298,268      branches                  #   11.557 M/sec                    ( +-  1.28% )  (33.36%)
           267,936      branch-misses             #    0.62% of all branches          ( +-  7.80% )  (33.37%)
     2,235,310,058      L1-dcache-loads           #  610.770 M/sec                    ( +-  0.02% )  (33.37%)
     1,089,602,355      L1-dcache-load-misses     #   48.73% of all L1-dcache accesses  ( +-  0.01% )  (33.37%)
        48,725,812      L1-icache-loads           #   13.314 M/sec                    ( +-  0.25% )  (33.37%)
           231,227      L1-icache-load-misses     #    0.47% of all L1-icache accesses  ( +- 13.20% )  (33.37%)
           280,655      dTLB-loads                #   76.685 K/sec                    ( +- 13.33% )  (33.37%)
           151,028      dTLB-load-misses          #   44.02% of all dTLB cache accesses  ( +-  6.64% )  (33.35%)
                15      iTLB-loads                #    4.099 /sec                     ( +-  6.67% )  (33.32%)
           121,208      iTLB-load-misses          # 865771.43% of all iTLB cache accesses  ( +-  2.74% )  (33.29%)
        18,702,209      L1-dcache-prefetches      #    5.110 M/sec                    ( +- 12.51% )  (33.27%)

           3.66065 +- 0.00461 seconds time elapsed  ( +-  0.13% )


# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):

          3,544.20 msec task-clock                #    0.996 CPUs utilized            ( +-  0.21% )
                 5      context-switches          #    1.406 /sec                     ( +-  6.67% )
                 1      cpu-migrations            #    0.281 /sec                   
               227      page-faults               #   63.819 /sec                     ( +-  0.15% )
     8,609,810,964      cycles                    #    2.421 GHz                      ( +-  0.21% )  (33.30%)
        77,420,424      stalled-cycles-frontend   #    0.90% frontend cycles idle     ( +- 20.55% )  (33.33%)
        25,197,541      stalled-cycles-backend    #    0.29% backend cycles idle      ( +-  1.09% )  (33.35%)
       658,146,061      instructions              #    0.08  insn per cycle         
                                                  #    0.16  stalled cycles per insn  ( +-  0.04% )  (33.38%)
       154,867,131      branches                  #   43.539 M/sec                    ( +-  0.04% )  (33.41%)
           167,531      branch-misses             #    0.11% of all branches          ( +-  5.19% )  (33.41%)
     2,417,846,489      L1-dcache-loads           #  679.753 M/sec                    ( +-  0.01% )  (33.38%)
     1,075,531,869      L1-dcache-load-misses     #   44.49% of all L1-dcache accesses  ( +-  0.01% )  (33.35%)
        12,835,321      L1-icache-loads           #    3.609 M/sec                    ( +-  0.41% )  (33.33%)
            55,282      L1-icache-load-misses     #    0.43% of all L1-icache accesses  ( +-  1.98% )  (33.30%)
            23,287      dTLB-loads                #    6.547 K/sec                    ( +- 15.61% )  (33.29%)
             1,333      dTLB-load-misses          #    4.48% of all dTLB cache accesses  ( +-  1.26% )  (33.29%)
                 3      iTLB-loads                #    0.843 /sec                     ( +- 33.33% )  (33.29%)
               231      iTLB-load-misses          # 11550.00% of all iTLB cache accesses  ( +-  6.14% )  (33.29%)
       170,608,062      L1-dcache-prefetches      #   47.965 M/sec                    ( +-  0.84% )  (33.29%)

           3.55776 +- 0.00738 seconds time elapsed  ( +-  0.21% )


# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb

 Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):

          1,718.27 msec task-clock                #    0.999 CPUs utilized            ( +-  0.08% )
                 6      context-switches          #    3.489 /sec                     ( +- 14.70% )
                 1      cpu-migrations            #    0.581 /sec                   
               227      page-faults               #  132.000 /sec                     ( +-  0.15% )
     4,176,107,493      cycles                    #    2.428 GHz                      ( +-  0.08% )  (33.19%)
         2,675,797      stalled-cycles-frontend   #    0.06% frontend cycles idle     ( +-  0.34% )  (33.25%)
       147,394,527      stalled-cycles-backend    #    3.53% backend cycles idle      ( +-  8.80% )  (33.31%)
        12,779,784      instructions              #    0.00  insn per cycle         
                                                  #   13.14  stalled cycles per insn  ( +-  0.09% )  (33.37%)
         2,428,829      branches                  #    1.412 M/sec                    ( +-  0.08% )  (33.42%)
            63,460      branch-misses             #    2.61% of all branches          ( +-  3.48% )  (33.46%)
        31,159,378      L1-dcache-loads           #   18.119 M/sec                    ( +-  3.27% )  (33.46%)
        14,692,358      L1-dcache-load-misses     #   48.21% of all L1-dcache accesses  ( +-  3.12% )  (33.46%)
         2,556,688      L1-icache-loads           #    1.487 M/sec                    ( +-  0.89% )  (33.46%)
            21,148      L1-icache-load-misses     #    0.84% of all L1-icache accesses  ( +-  0.25% )  (33.41%)
             6,114      dTLB-loads                #    3.555 K/sec                    ( +- 12.76% )  (33.35%)
             1,742      dTLB-load-misses          #   33.73% of all dTLB cache accesses  ( +- 21.79% )  (33.29%)
                45      iTLB-loads                #   26.167 /sec                     ( +-  7.52% )  (33.23%)
                90      iTLB-load-misses          #  210.94% of all iTLB cache accesses  ( +- 21.20% )  (33.17%)
           257,942      L1-dcache-prefetches      #  149.993 K/sec                    ( +-  9.84% )  (33.17%)

           1.72042 +- 0.00139 seconds time elapsed  ( +-  0.08% )

---
 arch/x86/mm/hugetlbpage.c | 49 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..4294b77c4f18 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,55 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 		return hugetlb_get_unmapped_area_topdown(file, addr, len,
 				pgoff, flags);
 }
+
+/*
+ * This is used on all !CONFIG_HIGHMEM configurations.
+ *
+ * CONFIG_HIGHMEM, falls back to the __weak version.
+ */
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned long vaddr,
+				unsigned int npages)
+{
+	clear_user_pages(page_address(page), vaddr, page, npages);
+}
+
+void clear_huge_page(struct page *page,
+		     unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+	unsigned long addr = addr_hint &
+		~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+	const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+	const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+	const int width = 2; /* pages cleared last on either side */
+	int sidx[3], eidx[3];
+	int i, n;
+
+	if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+		return clear_contig_region(page, addr, pages_per_huge_page);
+
+	/*
+	 * Neighbourhood of the fault. Cleared at the end to ensure
+	 * it sticks around in the cache.
+	 */
+	n = 2;
+	sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+	eidx[n] = (pgidx + width) > last_pg  ? last_pg  : (pgidx + width);
+
+	sidx[0] = first_pg;	/* Region to the left of the fault */
+	eidx[0] = sidx[n] - 1;
+
+	sidx[1] = eidx[n] + 1;	/* Region to the right of the fault */
+	eidx[1] = last_pg;
+
+	for (i = 0; i <= 2; i++) {
+		if (eidx[i] >= sidx[i])
+			clear_contig_region(page + sidx[i],
+					    addr + sidx[i] * PAGE_SIZE,
+					    eidx[i] - sidx[i] + 1);
+	}
+}
+#endif /* CONFIG_HIGHMEM */
 #endif /* CONFIG_HUGETLB_PAGE */
 
 #ifdef CONFIG_X86_64
-- 
2.31.1