lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Wed, 3 Feb 2016 14:56:28 +0000
From:	Mel Gorman <mgorman@...hsingularity.net>
To:	Ingo Molnar <mingo@...nel.org>
Cc:	Peter Zijlstra <peterz@...radead.org>,
	Matt Fleming <matt@...eblueprint.co.uk>,
	Mike Galbraith <mgalbraith@...e.de>,
	Srikar Dronamraju <srikar@...ux.vnet.ibm.com>,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH 1/1] sched: Make schedstats a runtime tunable that is
 disabled by default v4

On Wed, Feb 03, 2016 at 01:32:46PM +0000, Mel Gorman wrote:
> > Yes, but the question is, are there true cross-CPU cache-misses? I.e. are there 
> > any 'global' (or per node) counters that we keep touching and which keep 
> > generating cache-misses?
> > 
> 
> I haven't specifically identified them as I consider the calculations for
> some of them to be expensive in their own right even without accounting for
> cache misses. Moving to per-cpu counters would not eliminate all cache misses
> as a stat updated on one CPU for a task that is woken on a separate CPU is
> still going to trigger a cache miss. Even if such counters were identified
> and moved to separate cache lines, the calculation overhead would remain.
> 

I looked closer with perf stat to see if there was a good case for reducing
cache misses using per-cpu counters.

Workload was hackbench with pipes and twice as many processes as there
are CPUs to generate a reasonable amount of scheduler activity.

Kernel 4.5-rc2 vanilla
 Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

      54355.194747      task-clock (msec)         #   35.825 CPUs utilized            ( +-  0.72% )  (100.00%)
         6,654,707      context-switches          #    0.122 M/sec                    ( +-  1.56% )  (100.00%)
           376,624      cpu-migrations            #    0.007 M/sec                    ( +-  3.43% )  (100.00%)
           128,533      page-faults               #    0.002 M/sec                    ( +-  1.80% )  (100.00%)
   111,173,775,559      cycles                    #    2.045 GHz                      ( +-  0.76% )  (52.55%)
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
    87,243,428,243      instructions              #    0.78  insns per cycle          ( +-  0.38% )  (63.74%)
    17,067,078,003      branches                  #  313.992 M/sec                    ( +-  0.39% )  (61.79%)
        65,864,607      branch-misses             #    0.39% of all branches          ( +-  2.10% )  (61.51%)
    26,873,984,605      L1-dcache-loads           #  494.414 M/sec                    ( +-  0.45% )  (33.08%)
     1,531,628,468      L1-dcache-load-misses     #    5.70% of all L1-dcache hits    ( +-  1.14% )  (31.65%)
       410,990,209      LLC-loads                 #    7.561 M/sec                    ( +-  1.08% )  (31.38%)
        38,279,473      LLC-load-misses           #    9.31% of all LL-cache hits     ( +-  6.82% )  (42.35%)

       1.517251315 seconds time elapsed                                          ( +-  1.55% )

Note that the actual cache miss ratio is quite low and indicates that
there is potentially little to gain from using per-cpu counters.

Kernel 4.5-rc2 plus patch that disables schedstats by default

 Performance counter stats for './hackbench -pipe 96 process 1000' (5 runs):

      51904.139186      task-clock (msec)         #   35.322 CPUs utilized            ( +-  2.07% )  (100.00%)
         5,958,009      context-switches          #    0.115 M/sec                    ( +-  5.90% )  (100.00%)
           327,235      cpu-migrations            #    0.006 M/sec                    ( +-  8.24% )  (100.00%)
           130,063      page-faults               #    0.003 M/sec                    ( +-  1.10% )  (100.00%)
   104,926,877,727      cycles                    #    2.022 GHz                      ( +-  2.12% )  (52.08%)
   <not supported>      stalled-cycles-frontend  
   <not supported>      stalled-cycles-backend   
    83,768,167,895      instructions              #    0.80  insns per cycle          ( +-  1.25% )  (63.49%)
    16,379,438,730      branches                  #  315.571 M/sec                    ( +-  1.47% )  (61.99%)
        59,841,332      branch-misses             #    0.37% of all branches          ( +-  4.60% )  (61.68%)
    25,749,569,276      L1-dcache-loads           #  496.099 M/sec                    ( +-  1.37% )  (34.08%)
     1,385,090,233      L1-dcache-load-misses     #    5.38% of all L1-dcache hits    ( +-  3.40% )  (31.88%)
       358,531,172      LLC-loads                 #    6.908 M/sec                    ( +-  4.65% )  (31.04%)
        33,476,691      LLC-load-misses           #    9.34% of all LL-cache hits     ( +-  4.95% )  (41.71%)

       1.469447783 seconds time elapsed                                          ( +-  2.23% )

Now, note that there is a reduction in cache misses but it's not a major
percentage and the miss ratio is only dropped slightly in comparison to
having stats enabled.

While a perf report shows there is a drop in cache references in
functions like ttwu_stat and [en|de]queue_entity but it's a small
percentage overall. The same is true for the cycle count. The overall
percentage is small but the patch eliminates them.

Based on the low level of cache misses, I see no value to using per-cpu
counters as an alternative.

-- 
Mel Gorman
SUSE Labs

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ