linux-kernel - Re: [PATCH v2] events/core: fix acoount failure for event's total_enable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20241220133359.GC17537@noisy.programming.kicks-ass.net>
Date: Fri, 20 Dec 2024 14:33:59 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: Yeoreum Yun <yeoreum.yun@....com>
Cc: mingo@...hat.com, acme@...nel.org, namhyung@...nel.org,
	mark.rutland@....com, alexander.shishkin@...ux.intel.com,
	jolsa@...nel.org, irogers@...gle.com, adrian.hunter@...el.com,
	kan.liang@...ux.intel.com, linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] events/core: fix acoount failure for event's
 total_enable_time

On Fri, Dec 20, 2024 at 10:02:02AM +0000, Yeoreum Yun wrote:
> The perf core code fails to account for total_enable_time of event
> when its state is inactive.
> 
> Here is the error case for failure to account for total_enable_time for
> core event:
> 
> sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
> ...
> armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
> armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
                                        ^^^^^

So *that* number is wrong and should have been equal to the number above
it.

> 
>  Performance counter stats for 'stress-ng --pthread=2 -t 2s':
> 
>      1,138,698,008      armv8_pmuv3_0/event=0x08/                                               (94.99%)
>      1,826,791,390      armv8_pmuv3_1/event=0x08/                                               (43.47%)
> 
> Since above two events are belong to the same task context and
> mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
> the total_enable_time should be the same (the 2nd number among 3 numbers)
> and the summation of ratio should be 100%.

So the enable_time for the second event is too low, meaning the ratio
hsould have been ~15% rather than ~43%.

/me goes try and congest the wall of text below for more clues :/

> This account failure of total_enable_time for event could happen in below sequence.
> 
> 1. two event opened with:
>    - first event (e0) is opened with pmu0(p0) which could be added on CPU0.
>    - second event (e1) is opened with pmu1(p1) which could be added on CPU1.
>    - these two event belongs to the same task_ctx.
> 
>    at this point:
>      task_ctx->time = 0
>      e0->total_enable_time = 0
>      e0->total_running_time = 0
>      e1->total_enable_time = 0
>      e1->total_running_time = 0
> 
> 2. the task_ctx is sched in CPU0.
>    - In ctx_sched_in(), the task_ctx->time doesn't updated.
>    - In event_sched_in() e0 is activated so, its state becomes ACTIVE.
>    - In event_sched_in() e1 is activated, but soon becomes !ACTIVE
>      because pmu1 doesn't support CPU1 so it failed in pmu1->add().

This doesn't make sense; e1 should never reach event_sched_in() and it
should already be INACTIVE.

Notable events are created INACTIVE when !attr->disabled.

Also, scheduling should not get beyond merge_sched_in()'s
event_filter_match(), which will find the CPU is a mismatch and stop
right there.

This also means the event (e1) does not get to go on flexible_active
(see below).

> 
>    at this point:
>      task_ctx->time = 0
>      e0->total_enable_time = 0
>      e0->total_running_time = 0
>      e1->total_enable_time = 0
>      e1->total_running_time = 0

Both have the same state, surely that could've been compressed somewhat.

/me deletes giant wall of text

> 7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
>    - In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
>    - e0 was ACTIVE, so In event_sched_out(),
>      e0->total_enable_time and e1->total_running_time are updated with t3.
>    - e1 was !ACTIVE, so it doesn't update
>      e1->total_enable_time and e1->total_running_time.
> 
>    at this point:
>      task_ctx->time = t1 + t2 + t3
>      e0->total_enable_time = t1 + t2 +t3
>      e0->total_running_time = t1 + t3
>      e1->total_enable_time = t1 + t2
>      e1->total_running_time = t2
> 
> In step 7, amount of t3 enable time is missing in e1.
> So, the perf prints error ratio for total_running_time/total_enable_time
> for e1.
> 
> Here is the summary of change of time as timeline:
> 
>                CPU0                  CPU1                     CPU0
>           |  in | out |       | in | out     |       | in      | out (exit)   |
> ------------------------------------------------------------------------------
> ctx->time |  0  | t1  |       | t1 | t1 + t2 |       | t1 + t2 | t1 + t2 + t3 |
> ------------------------------------------------------------------------------
> e0->ena   |  0  | t1  |       | t1 | t1     *|       | t1 + t2 | t1 + t2 + t3 |
> ------------------------------------------------------------------------------
> e0->run   |  0  | t1  |       | t1 | t1      |       | t1 + t2 | t1 + t3      |
> ------------------------------------------------------------------------------
> e1->ena   |  0  | 0  *|       | t1 | t1 + t2 |       | t1 + t2 | t1 + t2  XXX |
> ------------------------------------------------------------------------------
> e1->run   |  0  | 0   |       | 0  | t2      |       | t1 + t2 | t2           |
> ------------------------------------------------------------------------------

So just the table is fine (better even than this giant wall of text),
but you have to explain what's what. Eg, there is no clear progression,
and having two columns both labeled CPU0 is highly confusing.

Also, a visual indication of where the error happens is useful. So the
cell marked with XXX is the failure case, right, that's missing '+ t3'.

Except then you should also mark those I stuck a * on.

> To address this, update total_enable_time in event_sched_out() when event state
> is PERF_EVENT_STATE_INACTIVE.

This is a giant jump that I'm not following. Notably ctx_sched_out()
will only iterate pmu_ctx->{pinned,flexible}_active and that list should
only include ACTIVE events.

So how does handling INACTIVE in event_sched_out() even begin to help?

Additionally, where does perf-stat get those numbers from; because
afaict perf_event_read() will do perf_event_update_time() and you should
be getting the correct numbers, no?

I'm utterly lost in the perf-stat code :-( Also it appears I can't even
build tools/perf anymore :-(

> After this patch, this problem is gone like:

Yes, but why? Something smells.

Looking at group_sched_in(), I'm thinking that that STATE_OFF thing is
totally wrong, should that not be something like:

	if (group_event->state <= PERF_EVENT_STATE_OFF)
		return -EBUSY;

or something !0 at any rate.


So what path leads to event_sched_out() with !ACTIVE that's relevant
here?

It can't have geen that group_sched_out(), because that wasn't in your
first patch.

How is this a fix and why is it the right fix?