linux-kernel - Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <54D96EEF.40302@codeaurora.org>
Date:	Mon, 09 Feb 2015 18:37:35 -0800
From:	Stephen Boyd <sboyd@...eaurora.org>
To:	Daniel Thompson <daniel.thompson@...aro.org>,
	Will Deacon <will.deacon@....com>
CC:	Thomas Gleixner <tglx@...utronix.de>,
	John Stultz <john.stultz@...aro.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"patches@...aro.org" <patches@...aro.org>,
	"linaro-kernel@...ts.linaro.org" <linaro-kernel@...ts.linaro.org>,
	Sumit Semwal <sumit.semwal@...aro.org>,
	Steven Rostedt <rostedt@...dmis.org>,
	Russell King <linux@....linux.org.uk>,
	Catalin Marinas <Catalin.Marinas@....com>
Subject: Re: [PATCH v4 2/5] sched_clock: Optimize cache line usage

On 02/09/15 01:47, Daniel Thompson wrote:
> On 09/02/15 09:28, Will Deacon wrote:
>> On Sun, Feb 08, 2015 at 12:02:37PM +0000, Daniel Thompson wrote:
>>> Currently sched_clock(), a very hot code path, is not optimized to
>>> minimise its cache profile. In particular:
>>>
>>>   1. cd is not ____cacheline_aligned,
>>>
>>>   2. struct clock_data does not distinguish between hotpath and
>>>      coldpath data, reducing locality of reference in the hotpath,
>>>
>>>   3. Some hotpath data is missing from struct clock_data and is marked
>>>      __read_mostly (which more or less guarantees it will not share a
>>>      cache line with cd).
>>>
>>> This patch corrects these problems by extracting all hotpath data
>>> into a separate structure and using ____cacheline_aligned to ensure
>>> the hotpath uses a single (64 byte) cache line.
>> Have you got any performance figures for this change, or is this just a
>> theoretical optimisation? It would be interesting to see what effect this
>> has on systems with 32-byte cachelines and also scenarios where there's
>> contention on the sequence counter.
> Most of my testing has focused on proving the NMI safety parts of the
> patch work as advertised so its mostly theoretical.
>
> However there are some numbers from simple tight loop calls to
> sched_clock (Stephen Boyd's results are more interesting than mine
> because I observe pretty wild quantization effects that render the
> results hard to trust):
> http://thread.gmane.org/gmane.linux.kernel/1871157/focus=1879265
>
> Not sure what useful figures would be useful for a contended sequence
> counter. Firstly the counter is taken for write at 7/8 wrap time of the
> times so even for the fastest timers the interval is likely to be >3s
> and is very short duration. Additionally, the NMI safety changes make it
> possible to read the timer whilst it is being updated so it is only
> during the very short struct-copy/write/struct-copy/write update
> sequence that we will observe the extra cache line used for a read.
> Benchmarks that show the effect of update are therefore non-trivial to
> construct.
>

Here's the raw numbers for the tight loop. I noticed that if I don't use
perf I get a larger number of calls per 10s, most likely because we
aren't doing anything else. These are very lightly loaded systems, i.e.
busybox ramdisk with nothing going on. Kernel version is v3.19-rc4. The
CPU is Krait on msm8960 and msm8974, except on msm8974 it has the arm
architected timers backing sched_clock() vs. our own custom timer IP on
msm8960. The cache line size is 64 bytes. I also ran it on msm8660 which
is a Scorpion CPU with the same timer as msm8960 (custom timer IP) and a
cache line size of 32 bytes. Unfortunately nobody has ported Scorpion
over to perf events, so we don't hardware events.

msm8960 (before patch) 
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14528449 calls in 10000000290 ns
Made 14528925 calls in 10000000142 ns
Made 14524549 calls in 10000000587 ns
Made 14528164 calls in 10000000734 ns
Made 14524468 calls in 10000000290 ns
Made 14527198 calls in 10000000438 ns
Made 14523508 calls in 10000000734 ns
Made 14527894 calls in 10000000290 ns
Made 14529609 calls in 10000000734 ns
Made 14523114 calls in 10000000142 ns

 Performance counter stats for 'modprobe sched_clock_test' (10 runs):

      10009.635016      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                 7      context-switches          #    0.001 K/sec                    ( +- 16.16% )
                 0      cpu-migrations            #    0.000 K/sec               
                58      page-faults               #    0.006 K/sec               
        4003806350      cycles                    #    0.400 GHz                      ( +-  0.00% )
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle
         921924235      instructions              #    0.23  insns per cycle          ( +-  0.01% )
                 0      branches                  #    0.000 K/sec               
          58521151      branch-misses             #    5.846 M/sec                    ( +-  0.01% )

      10.011767657 seconds time elapsed                                          ( +-  0.00% )

msm8960 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 19626366 calls in 10000000587 ns
Made 19623708 calls in 10000000142 ns
Made 19623282 calls in 10000000290 ns
Made 19625304 calls in 10000000290 ns
Made 19625151 calls in 10000000291 ns
Made 19624906 calls in 10000000290 ns
Made 19625383 calls in 10000000143 ns
Made 19625235 calls in 10000000290 ns
Made 19624969 calls in 10000000290 ns
Made 19625209 calls in 10000000438 ns

 Performance counter stats for 'modprobe sched_clock_test' (10 runs):

      10009.883401      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                 7      context-switches          #    0.001 K/sec                    ( +- 15.88% )
                 0      cpu-migrations            #    0.000 K/sec                  
                58      page-faults               #    0.006 K/sec                  
        4003901511      cycles                    #    0.400 GHz                      ( +-  0.00% )
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
        1164635790      instructions              #    0.29  insns per cycle          ( +-  0.00% )
                 0      branches                  #    0.000 K/sec                  
          20039814      branch-misses             #    2.002 M/sec                    ( +-  0.00% )

      10.012092383 seconds time elapsed                                          ( +-  0.00% )

msm8974 (before patch)
----------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 21289694 calls in 10000000083 ns
Made 21289072 calls in 10000000082 ns
Made 21289550 calls in 10000000395 ns
Made 21288892 calls in 10000000291 ns
Made 21288987 calls in 10000000135 ns
Made 21289140 calls in 10000000395 ns
Made 21289161 calls in 10000000395 ns
Made 21288911 calls in 10000000239 ns
Made 21289204 calls in 10000000135 ns
Made 21288738 calls in 10000000135 ns

 Performance counter stats for 'modprobe sched_clock_test' (10 runs):

      10003.839348      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                 4      context-switches          #    0.000 K/sec                    ( +-  3.70% )
                 0      cpu-migrations            #    0.000 K/sec               
                58      page-faults               #    0.006 K/sec               
        6146323757      cycles                    #    0.614 GHz                      ( +-  0.00% )
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle
        1155527762      instructions              #    0.19  insns per cycle          ( +-  0.00% )
         107186099      branches                  #   10.714 M/sec                    ( +-  0.00% )
          35548359      branch-misses             #   33.17% of all branches          ( +-  0.00% )

      10.004769053 seconds time elapsed                                          ( +-  0.00% )

msm8974 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test 
Made 21289357 calls in 10000000239 ns
Made 21384961 calls in 10000000396 ns
Made 22105925 calls in 10000000238 ns
Made 27384126 calls in 10000000239 ns
Made 22107737 calls in 10000000134 ns
Made 21368867 calls in 10000000239 ns
Made 22106065 calls in 10000000395 ns
Made 27384196 calls in 10000000083 ns
Made 22107334 calls in 10000000291 ns
Made 21365426 calls in 10000000291 ns

 Performance counter stats for 'modprobe sched_clock_test' (10 runs):

      10003.753333      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                 7      context-switches          #    0.001 K/sec                    ( +- 18.18% )
                 0      cpu-migrations            #    0.000 K/sec                  
                58      page-faults               #    0.006 K/sec                  
        6837664600      cycles                    #    0.684 GHz                      ( +-  6.74% )
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
        1148993903      instructions              #    0.17  insns per cycle          ( +-  3.32% )
         115049358      branches                  #   11.501 M/sec                    ( +-  3.31% )
          42520513      branch-misses             #   36.96% of all branches          ( +-  5.00% )

      10.004769533 seconds time elapsed                                          ( +-  0.00% )

msm8660 (before patch)
----------------------

# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 14099029 calls in 10000000586 ns
Made 14099227 calls in 10000000735 ns
Made 14098763 calls in 10000000439 ns
Made 14099042 calls in 10000000291 ns
Made 14099273 calls in 10000000290 ns
Made 14100377 calls in 10000000586 ns
Made 14100183 calls in 10000000586 ns
Made 14099220 calls in 10000000586 ns
Made 14098853 calls in 10000000587 ns
Made 14099368 calls in 10000000142 ns

 Performance counter stats for 'modprobe sched_clock_test' (10 runs):

      10006.700528      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                11      context-switches          #    0.001 K/sec                    ( +- 10.38% )
                 0      cpu-migrations            #    0.000 K/sec                  
                56      page-faults               #    0.006 K/sec                  
                 0      cycles                    #    0.000 GHz                    
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
                 0      instructions             
                 0      branches                  #    0.000 K/sec                  
                 0      branch-misses             #    0.000 K/sec                  

      10.008796161 seconds time elapsed                                          ( +-  0.00% )

msm8660 (after patch)
---------------------
# perf stat -r 10 --post "rmmod sched_clock_test" modprobe sched_clock_test
Made 20555901 calls in 10000000438 ns
Made 15510019 calls in 10000000142 ns
Made 15510371 calls in 10000000587 ns
Made 15509184 calls in 10000000439 ns
Made 15509068 calls in 10000000291 ns
Made 15510719 calls in 10000000439 ns
Made 15508899 calls in 10000000291 ns
Made 15509206 calls in 10000000587 ns
Made 15509057 calls in 10000000290 ns
Made 15509178 calls in 10000000735 ns

 Performance counter stats for 'modprobe sched_clock_test' (10 runs):

      10009.491416      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.00% )
                13      context-switches          #    0.001 K/sec                    ( +- 10.82% )
                 0      cpu-migrations            #    0.000 K/sec                  
                58      page-faults               #    0.006 K/sec                  
                 0      cycles                    #    0.000 GHz                    
                 0      stalled-cycles-frontend   #    0.00% frontend cycles idle   
                 0      stalled-cycles-backend    #    0.00% backend  cycles idle   
                 0      instructions             
                 0      branches                  #    0.000 K/sec                  
                 0      branch-misses             #    0.000 K/sec                  

      10.011834087 seconds time elapsed                                          ( +-  0.00% )

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/