linux-kernel - Re: Optimize perf stat for large number of events/cpus

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191120151625.GG4007@krava>
Date:   Wed, 20 Nov 2019 16:16:25 +0100
From:   Jiri Olsa <jolsa@...hat.com>
To:     Andi Kleen <andi@...stfloor.org>
Cc:     acme@...nel.org, jolsa@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: Optimize perf stat for large number of events/cpus

On Fri, Nov 15, 2019 at 09:52:17PM -0800, Andi Kleen wrote:
> [v7: Address review feedback. Fix python script problem
> reported by 0day. Drop merged patches.]
> 
> This patch kit optimizes perf stat for a large number of events 
> on systems with many CPUs and PMUs.
> 
> Some profiling shows that the most overhead is doing IPIs to
> all the target CPUs. We can optimize this by using sched_setaffinity
> to set the affinity to a target CPU once and then doing
> the perf operation for all events on that CPU. This requires
> some restructuring, but cuts the set up time quite a bit.
> 
> In theory we could go further by parallelizing these setups
> too, but that would be much more complicated and for now just batching it
> per CPU seems to be sufficient. At some point with many more cores 
> parallelization or a better bulk perf setup API might be needed though.
> 
> In addition perf does a lot of redundant /sys accesses with
> many PMUs, which can be also expensve. This is also optimized.
> 
> On a large test case (>700 events with many weak groups) on a 94 CPU
> system I go from
> 
> real	0m8.607s
> user	0m0.550s
> sys	0m8.041s
> 
> to 
> 
> real	0m3.269s
> user	0m0.760s
> sys	0m1.694s
> 
> so shaving ~6 seconds of system time, at slightly more cost
> in perf stat itself. On a 4 socket system with the savings
> are more dramatic:
> 
> real	0m15.641s
> user	0m0.873s
> sys	0m14.729s
> 
> to 
> 
> real	0m4.493s
> user	0m1.578s
> sys	0m2.444s
> 
> so 11s difference in the user visible set up time.
> 
> Also available in 
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/stat-scale-10
> 
> v1: Initial post.
> v2: Rebase. Fix some minor issues.
> v3: Rebase. Address review feedback. Fix one minor issue
> v4: Modified based on review feedback. Now it maintains
> all_cpus per evlist. There is still a need for cpu_index iteration
> to get the correct index for indexing the file descriptors.
> Fix bug with unsorted cpu maps, now they are always sorted.
> Some cleanups and refactoring.
> v5: Split patches. Redo loop iteration again. Fix cpu map
> merging for uncore. Remove duplicates from cpumaps. Add unit
> tests.
> v6: Address review feedback. Fix some bugs. Add more comments.
> Merge one invalid patch split.
> v7: Address review feedback. Fix python scripting (thanks 0day)
> Minor updates.

I posted another 2 comments, but other than that I think it's ok

I don't like it, but can't see a better way ;-) and the speedup
is really impressive

thanks,
jirka