linux-kernel - Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200304041948.GA125774@t490s>
Date:   Tue, 3 Mar 2020 23:19:48 -0500
From:   Rafael Aquini <aquini@...hat.com>
To:     Andrea Arcangeli <aarcange@...hat.com>
Cc:     Will Deacon <will@...nel.org>,
        Catalin Marinas <catalin.marinas@....com>,
        Mark Salter <msalter@...hat.com>,
        Jon Masters <jcm@...masters.org>, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, linux-arm-kernel@...ts.infradead.org,
        Michal Hocko <mhocko@...nel.org>, QI Fuli <qi.fuli@...itsu.com>
Subject: Re: [PATCH 3/3] arm64: tlb: skip tlbi broadcast

On Mon, Mar 02, 2020 at 10:24:51AM -0500, Rafael Aquini wrote:
[...]
> I'm testing these changes against RHEL integration + regression
> tests, and I'll also run them against a specially crafted test
> to measure the impact on task-switching, if any. (I'll report back,
> soon)
>

As promised, I ran the the patches against a full round of kernel
integration/regression tests (the same we use to run for RHEL kernels)
unfortunately, there is no easy way to share these internal results, 
but apart from a couple warnings -- which were not related to the test
build -- everything went on very smoothly with the patches on top of
a RHEL-8 test-kernel.


I grabbed some perf numbers, to serve as kernel build benchmark. 
Test system is a 1 socket 32 core 3.3Ghz ARMv8 Ampere eMAG 256GB RAM.
rpmbuild spawns the build with make -j32 and besides the stock kernel RPM,
it also builds the -debug flavor RPM and all the sub-RPMs for tools and
extra modules.

* stock RHEL-8 build:

 Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm':

     27,817,487.14 msec task-clock                #   15.015 CPUs utilized          
         1,318,586      context-switches          #    0.047 K/sec                  
           515,342      cpu-migrations            #    0.019 K/sec                  
        68,513,346      page-faults               #    0.002 M/sec                  
91,713,983,302,759      cycles                    #    3.297 GHz                    
49,871,167,452,081      instructions              #    0.54  insn per cycle         
23,801,939,026,338      cache-references          #  855.647 M/sec                  
   568,847,557,178      cache-misses              #    2.390 % of all cache refs    
   145,305,286,469      dTLB-loads                #    5.224 M/sec                  
       752,451,698      dTLB-load-misses          #    0.52% of all dTLB cache hits 

    1852.656905157 seconds time elapsed

   26866.849105000 seconds user
     965.756120000 seconds sys


* RHEL8 kernel + Andrea's patches:

 Performance counter stats for 'rpmbuild --rebuild kernel-4.18.0-184.el8.aatlb.src.rpm':

     27,713,883.25 msec task-clock                #   15.137 CPUs utilized          
         1,310,196      context-switches          #    0.047 K/sec                  
           511,909      cpu-migrations            #    0.018 K/sec                  
        68,535,178      page-faults               #    0.002 M/sec                  
91,412,320,904,990      cycles                    #    3.298 GHz                    
49,844,016,063,738      instructions              #    0.55  insn per cycle         
23,795,774,331,203      cache-references          #  858.623 M/sec                  
   568,445,037,308      cache-misses              #    2.389 % of all cache refs    
   135,868,301,669      dTLB-loads                #    4.903 M/sec                  
       746,267,581      dTLB-load-misses          #    0.55% of all dTLB cache hits 

    1830.813507976 seconds time elapsed

   26785.529337000 seconds user
     943.251641000 seconds sys




I also wanted to measure the impact of the increased amount of code in
the task switching path (in order to decide which TLB invalidation
scheme to pick), to figure out what would be the worst case scenario
for single threads of execution constrained into one core and yelding
the CPU to each other. I wrote the small test (attached) and grabbed
some numbers in the same box, for the sake of comparison:

* stock RHEL-8 build:

 Performance counter stats for './tlb-test' (1000 runs):

            122.67 msec task-clock                #    0.997 CPUs utilized            ( +-  0.04% )
            32,297      context-switches          #    0.263 M/sec                    ( +-  0.00% )
                 0      cpu-migrations            #    0.000 K/sec                  
               325      page-faults               #    0.003 M/sec                    ( +-  0.01% )
       404,648,928      cycles                    #    3.299 GHz                      ( +-  0.04% )
       239,856,265      instructions              #    0.59  insn per cycle           ( +-  0.00% )
       121,189,080      cache-references          #  987.964 M/sec                    ( +-  0.00% )
         3,414,396      cache-misses              #    2.817 % of all cache refs      ( +-  0.05% )
         2,820,754      dTLB-loads                #   22.996 M/sec                    ( +-  0.04% )
            34,545      dTLB-load-misses          #    1.22% of all dTLB cache hits   ( +-  6.16% )

         0.1230361 +- 0.0000435 seconds time elapsed  ( +-  0.04% )


* RHEL8 kernel + Andrea's patches:

 Performance counter stats for './tlb-test' (1000 runs):

            125.57 msec task-clock                #    0.997 CPUs utilized            ( +-  0.05% )
            32,244      context-switches          #    0.257 M/sec                    ( +-  0.01% )
                 0      cpu-migrations            #    0.000 K/sec                  
               325      page-faults               #    0.003 M/sec                    ( +-  0.01% )
       413,692,492      cycles                    #    3.295 GHz                      ( +-  0.04% )
       241,017,764      instructions              #    0.58  insn per cycle           ( +-  0.00% )
       121,155,050      cache-references          #  964.856 M/sec                    ( +-  0.00% )
         3,512,035      cache-misses              #    2.899 % of all cache refs      ( +-  0.04% )
         2,912,475      dTLB-loads                #   23.194 M/sec                    ( +-  0.02% )
            45,340      dTLB-load-misses          #    1.56% of all dTLB cache hits   ( +-  5.07% )

         0.1259462 +- 0.0000634 seconds time elapsed  ( +-  0.05% )



AFAICS, the aforementioned benchmark numbers are suggesting that there is, 
virtually, no considerable impact, or very minimal and no detrimental impact 
to ordinary workloads imposed by the changes, and Andrea's benchmarks are 
showing/suggesting that a broad range of particular workloads will considerably
benefit from the changes. 

With the numbers above, added to what I've seen in our (internal)
integration tests, I'm confident on the stability of the changes.

-- Rafael

View attachment "tlb-test.c" of type "text/plain" (5593 bytes)