[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191127151657.GE22719@kernel.org>
Date: Wed, 27 Nov 2019 12:16:57 -0300
From: Arnaldo Carvalho de Melo <arnaldo.melo@...il.com>
To: Andi Kleen <andi@...stfloor.org>
Cc: jolsa@...nel.org, linux-kernel@...r.kernel.org
Subject: Re: Optimize perf stat for large number of events/cpus
Em Wed, Nov 20, 2019 at 04:15:10PM -0800, Andi Kleen escreveu:
> [v8: Address review feedback. Only changes one patch.]
>
> This patch kit optimizes perf stat for a large number of events
> on systems with many CPUs and PMUs.
>
> Some profiling shows that the most overhead is doing IPIs to
> all the target CPUs. We can optimize this by using sched_setaffinity
> to set the affinity to a target CPU once and then doing
> the perf operation for all events on that CPU. This requires
> some restructuring, but cuts the set up time quite a bit.
>
> In theory we could go further by parallelizing these setups
> too, but that would be much more complicated and for now just batching it
> per CPU seems to be sufficient. At some point with many more cores
> parallelization or a better bulk perf setup API might be needed though.
>
> In addition perf does a lot of redundant /sys accesses with
> many PMUs, which can be also expensve. This is also optimized.
>
> On a large test case (>700 events with many weak groups) on a 94 CPU
> system I go from
>
> real 0m8.607s
> user 0m0.550s
> sys 0m8.041s
>
> to
>
> real 0m3.269s
> user 0m0.760s
> sys 0m1.694s
>
> so shaving ~6 seconds of system time, at slightly more cost
> in perf stat itself. On a 4 socket system the savings
> are more dramatic:
>
> real 0m15.641s
> user 0m0.873s
> sys 0m14.729s
>
> to
>
> real 0m4.493s
> user 0m1.578s
> sys 0m2.444s
>
> so 11s difference in the user visible set up time.
Applied to my local perf/core branch, now undergoing test builds on all
the containers.
Thanks,
- Arnaldo
Powered by blists - more mailing lists