linux-kernel - [x86/mm/tlb] aa44284960: will-it-scale.per_thread

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <0cc7c04b-825d-2c5a-2afe-c52c90554223@intel.com>
Date:   Mon, 29 Aug 2022 16:08:10 +0800
From:   kernel test robot <yujie.liu@...el.com>
To:     Nadav Amit <namit@...are.com>
CC:     <lkp@...ts.01.org>, <lkp@...el.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>,
        "Peter Zijlstra (Intel)" <peterz@...radead.org>,
        Andy Lutomirski <luto@...nel.org>,
        <linux-kernel@...r.kernel.org>, <ying.huang@...el.com>,
        <feng.tang@...el.com>, <zhengjun.xing@...ux.intel.com>,
        <fengwei.yin@...el.com>
Subject: [x86/mm/tlb] aa44284960: will-it-scale.per_thread_ops 12.8%
 improvement

Greeting,

FYI, we noticed a 12.8% improvement of will-it-scale.per_thread_ops due to commit:


commit: aa44284960d550eb4d8614afdffebc68a432a9b4 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 144 threads 4 sockets Intel(R) Xeon(R) Gold 5318H CPU @ 2.50GHz (Cooper Lake) with 128G memory
with following parameters:

	nr_task: 50%
	mode: thread
	test: tlb_flush3
	cpufreq_governor: performance
	ucode: 0x7002501

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale


=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase/ucode:
   gcc-11/performance/x86_64-rhel-8.3/thread/50%/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp1/tlb_flush3/will-it-scale/0x7002501

commit:
   e19d11267f ("x86/mm: Use PAGE_ALIGNED(x) instead of IS_ALIGNED(x, PAGE_SIZE)")
   aa44284960 ("x86/mm/tlb: Avoid reading mm_tlb_gen when possible")

e19d11267f0e6c8a aa44284960d550eb4d8614afdff
---------------- ---------------------------
          %stddev     %change         %stddev
              \          |                \
     511972           +12.8%     577452        will-it-scale.72.threads
       7110           +12.8%       8019        will-it-scale.per_thread_ops
     511972           +12.8%     577452        will-it-scale.workload
      29.88 ± 23%      +8.2       38.07        mpstat.cpu.all.sys%
       0.57 ± 22%      +0.2        0.78 ±  5%  mpstat.cpu.all.usr%
      76693            -0.8%      76064        proc-vmstat.nr_slab_unreclaimable
  1.693e+08           +12.2%    1.9e+08        proc-vmstat.pgfault
    -967489           +88.9%   -1827960        sched_debug.cfs_rq:/.spread0.min
      10581 ± 15%     -36.2%       6751 ±  9%  sched_debug.cpu.clock_task.stddev
       5957 ± 17%     +16.7%       6952        vmstat.system.cs
    5247717 ± 22%     +15.1%    6039997        vmstat.system.in
    1443105 ±  2%    +325.4%    6139401 ±147%  turbostat.C1
  2.359e+10           +12.1%  2.645e+10        turbostat.IRQ
      12.94 ± 15%     +10.6%      14.31        turbostat.RAMWatt
       9.42 ±  4%      -8.5        0.94 ± 36%  perf-profile.calltrace.cycles-pp.flush_tlb_func.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function
       8.85 ±  4%      -4.5        4.32 ±  9%  perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range
       8.76 ±  4%      -4.5        4.24 ±  9%  perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond
       8.76 ±  4%      -4.5        4.26 ±  9%  perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask
       4.63 ±  4%      -2.3        2.29 ±  9%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
       4.60 ±  4%      -2.3        2.33 ± 10%  perf-profile.calltrace.cycles-pp.asm_sysvec_call_function.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
      10.06 ±  6%      -2.2        7.87 ± 10%  perf-profile.calltrace.cycles-pp.__flush_smp_call_function_queue.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch
      10.08 ±  5%      -2.2        7.91 ± 10%  perf-profile.calltrace.cycles-pp.__sysvec_call_function.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond
      10.18 ±  5%      -2.1        8.04 ±  9%  perf-profile.calltrace.cycles-pp.sysvec_call_function.asm_sysvec_call_function.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask
      15.87 ±  7%      +3.0       18.91 ± 10%  perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.tlb_finish_mmu
      17.09 ±  7%      +3.8       20.93 ± 10%  perf-profile.calltrace.cycles-pp.llist_add_batch.smp_call_function_many_cond.on_each_cpu_cond_mask.flush_tlb_mm_range.zap_pte_range
      12.41 ±  4%     -10.3        2.14 ± 10%  perf-profile.children.cycles-pp.flush_tlb_func
      23.01 ±  4%      -8.8       14.19 ±  9%  perf-profile.children.cycles-pp.__flush_smp_call_function_queue
      22.79 ±  4%      -8.7       14.08 ±  9%  perf-profile.children.cycles-pp.sysvec_call_function
      22.56 ±  4%      -8.7       13.85 ±  9%  perf-profile.children.cycles-pp.__sysvec_call_function
      24.02 ±  4%      -8.6       15.41 ±  9%  perf-profile.children.cycles-pp.asm_sysvec_call_function
       1.29 ±  6%      -1.0        0.31 ±  8%  perf-profile.children.cycles-pp.native_flush_tlb_local
       0.82 ± 13%      -0.5        0.28 ± 16%  perf-profile.children.cycles-pp.sync_mm_rss
       0.54 ± 10%      -0.1        0.44 ± 10%  perf-profile.children.cycles-pp._find_next_bit
       0.12 ± 42%      -0.1        0.04 ± 88%  perf-profile.children.cycles-pp.cpumask_any_but
       0.21 ±  7%      +0.1        0.29 ± 14%  perf-profile.children.cycles-pp.tlb_gather_mmu
      33.53 ±  7%      +7.1       40.62 ± 10%  perf-profile.children.cycles-pp.llist_add_batch
      11.12 ±  5%      -9.3        1.79 ± 10%  perf-profile.self.cycles-pp.flush_tlb_func
       1.26 ±  5%      -1.0        0.30 ±  9%  perf-profile.self.cycles-pp.native_flush_tlb_local
       0.87 ±  4%      -0.6        0.27 ± 18%  perf-profile.self.cycles-pp.flush_tlb_mm_range
       0.52 ± 13%      -0.3        0.22 ± 16%  perf-profile.self.cycles-pp.sync_mm_rss
       0.14 ± 10%      -0.0        0.10 ±  5%  perf-profile.self.cycles-pp.zap_pte_range
       0.20 ±  9%      +0.1        0.26 ±  9%  perf-profile.self.cycles-pp.unmap_page_range
       0.24 ±  9%      +0.1        0.32 ± 10%  perf-profile.self.cycles-pp.down_read
       0.13 ±  7%      +0.1        0.22 ± 15%  perf-profile.self.cycles-pp.tlb_gather_mmu
       0.35 ± 10%      +0.1        0.45 ±  9%  perf-profile.self.cycles-pp.up_read
       0.26 ±  9%      +0.1        0.37 ± 12%  perf-profile.self.cycles-pp.down_read_trylock
       0.33 ±  8%      +0.1        0.46 ± 12%  perf-profile.self.cycles-pp.__handle_mm_fault
      22.59 ±  8%      +9.0       31.63 ± 10%  perf-profile.self.cycles-pp.llist_add_batch
  4.618e+09 ± 22%     +15.6%  5.338e+09        perf-stat.i.branch-instructions
   74833717 ± 15%     +18.9%   88940037        perf-stat.i.branch-misses
  2.701e+08 ± 23%     +21.6%  3.283e+08        perf-stat.i.cache-misses
  6.188e+08 ± 19%     +20.8%  7.473e+08        perf-stat.i.cache-references
       5907 ± 17%     +17.0%       6911        perf-stat.i.context-switches
     949.85 ± 36%     -24.2%     720.07        perf-stat.i.cycles-between-cache-misses
    3624697 ±  8%     +43.7%    5209027 ± 23%  perf-stat.i.dTLB-load-misses
  5.649e+09 ± 22%     +19.8%   6.77e+09        perf-stat.i.dTLB-loads
    1837493 ± 18%     +16.8%    2145673        perf-stat.i.dTLB-store-misses
  2.675e+09 ± 22%     +17.8%  3.151e+09        perf-stat.i.dTLB-stores
      83.96 ± 10%      +7.4       91.35        perf-stat.i.iTLB-load-miss-rate%
   17251959 ± 21%     +66.1%   28646920 ±  2%  perf-stat.i.iTLB-load-misses
    2371482 ±  9%     +12.0%    2655934 ±  2%  perf-stat.i.iTLB-loads
  2.082e+10 ± 22%     +17.4%  2.444e+10        perf-stat.i.instructions
       1222 ±  4%     -29.1%     866.58 ±  2%  perf-stat.i.instructions-per-iTLB-miss
      95.27 ± 22%     +18.2%     112.59        perf-stat.i.metric.M/sec
     491703 ± 23%     +24.4%     611915        perf-stat.i.minor-faults
  1.716e+08 ± 23%     +22.8%  2.107e+08        perf-stat.i.node-load-misses
   52413373 ± 23%     +32.1%   69220689        perf-stat.i.node-store-misses
     516138 ± 22%     +22.5%     632466        perf-stat.i.node-stores
     498600 ± 23%     +24.2%     619485        perf-stat.i.page-faults
      10.30            -6.2%       9.66        perf-stat.overall.cpi
     798.02 ±  2%      -9.8%     719.66        perf-stat.overall.cycles-between-cache-misses
      87.48 ±  2%      +4.0       91.50        perf-stat.overall.iTLB-load-miss-rate%
       1203           -29.0%     854.77 ±  2%  perf-stat.overall.instructions-per-iTLB-miss
       0.10            +6.7%       0.10        perf-stat.overall.ipc
   13671646 ±  2%      -6.3%   12814760        perf-stat.overall.path-length
  4.608e+09 ± 22%     +15.5%  5.321e+09        perf-stat.ps.branch-instructions
   74674366 ± 15%     +18.8%   88686188        perf-stat.ps.branch-misses
  2.694e+08 ± 23%     +21.4%  3.272e+08        perf-stat.ps.cache-misses
  6.173e+08 ± 19%     +20.7%  7.448e+08        perf-stat.ps.cache-references
       5889 ± 17%     +16.9%       6883        perf-stat.ps.context-switches
    3615042 ±  8%     +43.7%    5193424 ± 23%  perf-stat.ps.dTLB-load-misses
  5.637e+09 ± 22%     +19.7%  6.748e+09        perf-stat.ps.dTLB-loads
    1833221 ± 17%     +16.7%    2138714        perf-stat.ps.dTLB-store-misses
  2.669e+09 ± 22%     +17.7%  3.141e+09        perf-stat.ps.dTLB-stores
   17219686 ± 21%     +65.7%   28530220 ±  2%  perf-stat.ps.iTLB-load-misses
    2364495 ±  8%     +11.9%    2646506 ±  2%  perf-stat.ps.iTLB-loads
  2.077e+10 ± 22%     +17.3%  2.437e+10        perf-stat.ps.instructions
     490514 ± 23%     +24.3%     609737        perf-stat.ps.minor-faults
  1.712e+08 ± 23%     +22.6%    2.1e+08        perf-stat.ps.node-load-misses
   52293687 ± 23%     +31.9%   68974375        perf-stat.ps.node-store-misses
     515271 ± 21%     +22.4%     630779        perf-stat.ps.node-stores
     497546 ± 23%     +24.1%     617507        perf-stat.ps.page-faults
  6.998e+12            +5.7%    7.4e+12        perf-stat.total.instructions


To reproduce:

         git clone https://github.com/intel/lkp-tests.git
         cd lkp-tests
         sudo bin/lkp install job.yaml           # job file is attached in this email
         bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
         sudo bin/lkp run generated-yaml-file

         # if come across any failure that blocks the test,
         # please remove ~/.lkp and /lkp dir to run from a clean state.


Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


-- 
0-DAY CI Kernel Test Service
https://01.org/lkp
View attachment "config-5.18.0-01257-gaa44284960d5" of type "text/plain" (169430 bytes)

View attachment "job-script" of type "text/plain" (8365 bytes)

View attachment "job.yaml" of type "text/plain" (5728 bytes)

View attachment "reproduce" of type "text/plain" (361 bytes)