linux-kernel - Re: [PATCH v2 0/4] perf: Fix perf_event

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201109114813.GI2594@hirez.programming.kicks-ass.net>
Date:   Mon, 9 Nov 2020 12:48:13 +0100
From:   Peter Zijlstra <peterz@...radead.org>
To:     Andi Kleen <ak@...ux.intel.com>
Cc:     Jiri Olsa <jolsa@...hat.com>, mingo@...nel.org, acme@...nel.org,
        mark.rutland@....com, alexander.shishkin@...ux.intel.com,
        namhyung@...nel.org, linux-kernel@...r.kernel.org,
        eranian@...gle.com
Subject: Re: [PATCH v2 0/4] perf: Fix perf_event_attr::exclusive rotation

On Mon, Nov 02, 2020 at 06:41:43PM -0800, Andi Kleen wrote:
> On Mon, Nov 02, 2020 at 03:16:25PM +0100, Peter Zijlstra wrote:
> > On Sun, Nov 01, 2020 at 07:52:38PM -0800, Andi Kleen wrote:
> > > The main motivation is actually that the "multiple groups" algorithm
> > > in perf doesn't work all that great: it has quite a few cases where it
> > > starves groups or makes the wrong decisions. That is because it is very
> > > difficult (likely NP complete) problem and the kernel takes a lot
> > > of short cuts to avoid spending too much time on it.
> > 
> > The event scheduling should be starvation free, except in the presence
> > of pinned events.
> > 
> > If you can show starvation without pinned events, it's a bug.
> > 
> > It will also always do equal or better than exclusive mode wrt PMU
> > utilization. Again, if it doesn't it's a bug.
> 
> Simple example (I think we've shown that one before):
> 
> (on skylake)
> $ cat /proc/sys/kernel/nmi_watchdog
> 0
> $ perf stat -e instructions,cycles,frontend_retired.latency_ge_2,frontend_retired.latency_ge_16 -a sleep 2
> 
>  Performance counter stats for 'system wide':
> 
>        654,514,990      instructions              #    0.34  insn per cycle           (50.67%)
>      1,924,297,028      cycles                                                        (74.28%)
>         21,708,935      frontend_retired.latency_ge_2                                     (75.01%)
>          1,769,952      frontend_retired.latency_ge_16                                     (24.99%)
> 
>        2.002426541 seconds time elapsed
> 
> The second frontend_retired should be both getting 50% and the fixed events should be getting
> 100%. So several events are starved.

*should* how? Also, nothing is 0% so nothing is getting starved.

> Another similar example is trying to schedule the topdown events on Icelake in parallel to other
> groups. It works with one extra group, but breaks with two.
> 
> (on icelake)
> $ cat /proc/sys/kernel/nmi_watchdog
> 0
> $ perf stat -e '{slots,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring},{branches,branches,branches,branches,branches,branches,branches,branches},{branches,branches,branches,branches,branches,branches,branches,branches}' -a sleep 1
> 
>  Performance counter stats for 'system wide':
> 
>         71,229,087      slots                                                         (60.65%)
>          5,066,320      topdown-bad-spec          #      7.1% bad speculation         (60.65%)
>         35,080,387      topdown-be-bound          #     49.2% backend bound           (60.65%)
>         22,769,750      topdown-fe-bound          #     32.0% frontend bound          (60.65%)
>          8,336,760      topdown-retiring          #     11.7% retiring                (60.65%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>            424,584      branches                                                      (70.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
>          3,634,075      branches                                                      (30.00%)
> 
>        1.001312511 seconds time elapsed
> 
> A tool using exclusive hopefully will be able to do better than this.

I don't see how, exclusive will always result in equal or worse PMU
utilization, never better.