linux-kernel - Re: Optimize perf stat for large number of events/cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191127154305.GJ22719@kernel.org>
Date:   Wed, 27 Nov 2019 12:43:05 -0300
From:   Arnaldo Carvalho de Melo <arnaldo.melo@...il.com>
To:     Andi Kleen <andi@...stfloor.org>
Cc:     jolsa@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: Optimize perf stat for large number of events/cpus

Em Wed, Nov 27, 2019 at 12:16:57PM -0300, Arnaldo Carvalho de Melo escreveu:
> Em Wed, Nov 20, 2019 at 04:15:10PM -0800, Andi Kleen escreveu:
> > [v8: Address review feedback. Only changes one patch.]
> > 
> > This patch kit optimizes perf stat for a large number of events 
> > on systems with many CPUs and PMUs.
> > 
> > Some profiling shows that the most overhead is doing IPIs to
> > all the target CPUs. We can optimize this by using sched_setaffinity
> > to set the affinity to a target CPU once and then doing
> > the perf operation for all events on that CPU. This requires
> > some restructuring, but cuts the set up time quite a bit.
> > 
> > In theory we could go further by parallelizing these setups
> > too, but that would be much more complicated and for now just batching it
> > per CPU seems to be sufficient. At some point with many more cores 
> > parallelization or a better bulk perf setup API might be needed though.
> > 
> > In addition perf does a lot of redundant /sys accesses with
> > many PMUs, which can be also expensve. This is also optimized.
> > 
> > On a large test case (>700 events with many weak groups) on a 94 CPU
> > system I go from
> > 
> > real	0m8.607s
> > user	0m0.550s
> > sys	0m8.041s
> > 
> > to 
> > 
> > real	0m3.269s
> > user	0m0.760s
> > sys	0m1.694s
> > 
> > so shaving ~6 seconds of system time, at slightly more cost
> > in perf stat itself. On a 4 socket system the savings
> > are more dramatic:
> > 
> > real	0m15.641s
> > user	0m0.873s
> > sys	0m14.729s
> > 
> > to 
> > 
> > real	0m4.493s
> > user	0m1.578s
> > sys	0m2.444s
> > 
> > so 11s difference in the user visible set up time.
> 
> Applied to my local perf/core branch, now undergoing test builds on all
> the containers.

So, have you tried running 'perf test' after each cset is applied and
built?

[root@...co ~]# perf test 49
49: Event times                                           : FAILED!

I did a bisect and it ends at:

[acme@...co perf]$ git bisect good
af39eb7d060751f7f3336e0ffa713575c6bea902 is the first bad commit
commit af39eb7d060751f7f3336e0ffa713575c6bea902
Author: Andi Kleen <ak@...ux.intel.com>
Date:   Wed Nov 20 16:15:19 2019 -0800

    perf stat: Use affinity for opening events

    Restructure the event opening in perf stat to cycle through the events
    by CPU after setting affinity to that CPU.

---------

Which for me was a surprise till I saw that this doesn't touch just
'perf stat' as the commit log seems to indicate.

Please check this, and consider splitting the patches to help with
bisection.

I'm keeping this in a separate local branch for now, will leave the
first few patches, that seems ok to go now.

- Arnaldo