[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20191121001522.180827-1-andi@firstfloor.org>
Date: Wed, 20 Nov 2019 16:15:10 -0800
From: Andi Kleen <andi@...stfloor.org>
To: acme@...nel.org
Cc: jolsa@...nel.org, linux-kernel@...r.kernel.org
Subject: Optimize perf stat for large number of events/cpus
[v8: Address review feedback. Only changes one patch.]
This patch kit optimizes perf stat for a large number of events
on systems with many CPUs and PMUs.
Some profiling shows that the most overhead is doing IPIs to
all the target CPUs. We can optimize this by using sched_setaffinity
to set the affinity to a target CPU once and then doing
the perf operation for all events on that CPU. This requires
some restructuring, but cuts the set up time quite a bit.
In theory we could go further by parallelizing these setups
too, but that would be much more complicated and for now just batching it
per CPU seems to be sufficient. At some point with many more cores
parallelization or a better bulk perf setup API might be needed though.
In addition perf does a lot of redundant /sys accesses with
many PMUs, which can be also expensve. This is also optimized.
On a large test case (>700 events with many weak groups) on a 94 CPU
system I go from
real 0m8.607s
user 0m0.550s
sys 0m8.041s
to
real 0m3.269s
user 0m0.760s
sys 0m1.694s
so shaving ~6 seconds of system time, at slightly more cost
in perf stat itself. On a 4 socket system the savings
are more dramatic:
real 0m15.641s
user 0m0.873s
sys 0m14.729s
to
real 0m4.493s
user 0m1.578s
sys 0m2.444s
so 11s difference in the user visible set up time.
Also available in
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-misc perf/stat-scale-11
v1: Initial post.
v2: Rebase. Fix some minor issues.
v3: Rebase. Address review feedback. Fix one minor issue
v4: Modified based on review feedback. Now it maintains
all_cpus per evlist. There is still a need for cpu_index iteration
to get the correct index for indexing the file descriptors.
Fix bug with unsorted cpu maps, now they are always sorted.
Some cleanups and refactoring.
v5: Split patches. Redo loop iteration again. Fix cpu map
merging for uncore. Remove duplicates from cpumaps. Add unit
tests.
v6: Address review feedback. Fix some bugs. Add more comments.
Merge one invalid patch split.
v7: Address review feedback. Fix python scripting (thanks 0day)
Minor updates.
v8: Address review feedback.
-Andi
Powered by blists - more mailing lists