linux-kernel - Re: [PATCH v2] events/core: fix acoount failure for event's total_enable

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z2V5s2JM4W7BRckR@e129823.arm.com>
Date: Fri, 20 Dec 2024 14:05:39 +0000
From: Yeoreum Yun <yeoreum.yun@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, acme@...nel.org, namhyung@...nel.org,
	mark.rutland@....com, alexander.shishkin@...ux.intel.com,
	jolsa@...nel.org, irogers@...gle.com, adrian.hunter@...el.com,
	kan.liang@...ux.intel.com, linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] events/core: fix acoount failure for event's
 total_enable_time

Hi, Peter.

On Fri, Dec 20, 2024 at 02:33:59PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 20, 2024 at 10:02:02AM +0000, Yeoreum Yun wrote:
> > The perf core code fails to account for total_enable_time of event
> > when its state is inactive.
> >
> > Here is the error case for failure to account for total_enable_time for
> > core event:
> >
> > sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
> > ...
> > armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
> > armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
>
> So *that* number is wrong and should have been equal to the number above
> it.

Yes it is.

> >
> >  Performance counter stats for 'stress-ng --pthread=2 -t 2s':
> >
> >      1,138,698,008      armv8_pmuv3_0/event=0x08/                                               (94.99%)
> >      1,826,791,390      armv8_pmuv3_1/event=0x08/                                               (43.47%)
> >
> > Since above two events are belong to the same task context and
> > mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
> > the total_enable_time should be the same (the 2nd number among 3 numbers)
> > and the summation of ratio should be 100%.
>
> So the enable_time for the second event is too low, meaning the ratio
> hsould have been ~15% rather than ~43%.
>
> /me goes try and congest the wall of text below for more clues :/
>
> > This account failure of total_enable_time for event could happen in below sequence.
> >
> > 1. two event opened with:
> >    - first event (e0) is opened with pmu0(p0) which could be added on CPU0.
> >    - second event (e1) is opened with pmu1(p1) which could be added on CPU1.
> >    - these two event belongs to the same task_ctx.
> >
> >    at this point:
> >      task_ctx->time = 0
> >      e0->total_enable_time = 0
> >      e0->total_running_time = 0
> >      e1->total_enable_time = 0
> >      e1->total_running_time = 0
> >
> > 2. the task_ctx is sched in CPU0.
> >    - In ctx_sched_in(), the task_ctx->time doesn't updated.
> >    - In event_sched_in() e0 is activated so, its state becomes ACTIVE.
> >    - In event_sched_in() e1 is activated, but soon becomes !ACTIVE
> >      because pmu1 doesn't support CPU1 so it failed in pmu1->add().
>
> This doesn't make sense; e1 should never reach event_sched_in() and it
> should already be INACTIVE.
>
> Notable events are created INACTIVE when !attr->disabled.

But in perf stat code, it via enable_counter(), so it's set with
INACTIVE.

> Also, scheduling should not get beyond merge_sched_in()'s
> event_filter_match(), which will find the CPU is a mismatch and stop
> right there.
>
> This also means the event (e1) does not get to go on flexible_active
> (see below).

No, when perf stat command with above, the cpu sets as == -1,
So, It doesn't filter out in event_filter_match(). so it enter into
merge_sched_in() and get into event_sched_in().

> >    at this point:
> >      task_ctx->time = 0
> >      e0->total_enable_time = 0
> >      e0->total_running_time = 0
> >      e1->total_enable_time = 0
> >      e1->total_running_time = 0
>
> Both have the same state, surely that could've been compressed somewhat.
>
> /me deletes giant wall of text
>
> > 7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
> >    - In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
> >    - e0 was ACTIVE, so In event_sched_out(),
> >      e0->total_enable_time and e1->total_running_time are updated with t3.
> >    - e1 was !ACTIVE, so it doesn't update
> >      e1->total_enable_time and e1->total_running_time.
> >
> >    at this point:
> >      task_ctx->time = t1 + t2 + t3
> >      e0->total_enable_time = t1 + t2 +t3
> >      e0->total_running_time = t1 + t3
> >      e1->total_enable_time = t1 + t2
> >      e1->total_running_time = t2
> >
> > In step 7, amount of t3 enable time is missing in e1.
> > So, the perf prints error ratio for total_running_time/total_enable_time
> > for e1.
> >
> > Here is the summary of change of time as timeline:
> >
> >                CPU0                  CPU1                     CPU0
> >           |  in | out |       | in | out     |       | in      | out (exit)   |
> > ------------------------------------------------------------------------------
> > ctx->time |  0  | t1  |       | t1 | t1 + t2 |       | t1 + t2 | t1 + t2 + t3 |
> > ------------------------------------------------------------------------------
> > e0->ena   |  0  | t1  |       | t1 | t1     *|       | t1 + t2 | t1 + t2 + t3 |
> > ------------------------------------------------------------------------------
> > e0->run   |  0  | t1  |       | t1 | t1      |       | t1 + t2 | t1 + t3      |
> > ------------------------------------------------------------------------------
> > e1->ena   |  0  | 0  *|       | t1 | t1 + t2 |       | t1 + t2 | t1 + t2  XXX |
> > ------------------------------------------------------------------------------
> > e1->run   |  0  | 0   |       | 0  | t2      |       | t1 + t2 | t2           |
> > ------------------------------------------------------------------------------
>
> So just the table is fine (better even than this giant wall of text),
> but you have to explain what's what. Eg, there is no clear progression,
> and having two columns both labeled CPU0 is highly confusing.
>
> Also, a visual indication of where the error happens is useful. So the
> cell marked with XXX is the failure case, right, that's missing '+ t3'.
>
> Except then you should also mark those I stuck a * on.
>

Thanks. I'll rewrite it again.

> > To address this, update total_enable_time in event_sched_out() when event state
> > is PERF_EVENT_STATE_INACTIVE.
>
> This is a giant jump that I'm not following. Notably ctx_sched_out()
> will only iterate pmu_ctx->{pinned,flexible}_active and that list should
> only include ACTIVE events.
> So how does handling INACTIVE in event_sched_out() even begin to help?

the answer is in the perf_event_exit_event()'s
perf_remove_from_context(). in here
event_sched_out() is called via __perf_remove_from_context()
So above case, the enable time is fixed in here.

But I think in case of change of group_ctx_out() is my mistak. Sorry.

>
> Additionally, where does perf-stat get those numbers from; because
> afaict perf_event_read() will do perf_event_update_time() and you should
> be getting the correct numbers, no?
>
> I'm utterly lost in the perf-stat code :-( Also it appears I can't even
> build tools/perf anymore :-(
>

After exit task. in this case, the ctx->is_active doesn't have
EVENT_TIME, it doesn't update ctx->time and the event time.
So, the task_ctx which is child one is callected with missing
total_enable_time.

> > After this patch, this problem is gone like:
>
> Yes, but why? Something smells.
> Looking at group_sched_in(), I'm thinking that that STATE_OFF thing is
> totally wrong, should that not be something like:
>
> 	if (group_event->state <= PERF_EVENT_STATE_OFF)
> 		return -EBUSY;
>
> or something !0 at any rate.
>
>
> So what path leads to event_sched_out() with !ACTIVE that's relevant
> here?
>
> It can't have geen that group_sched_out(), because that wasn't in your
> first patch.
>
> How is this a fix and why is it the right fix?
>
>

As I said above, at the time of exit. __perf_remove_from_context()
calls in event_sched_out(). so it fixes the missing enabled time at the
exit.

the middle of event''s total_enable_time() is updated in ctx_sched_in()
(as I above said, it doesn't filter out because it runs with cpu == -1).
So, it's enough to add missing enabled time with exit() and this is
fixed via event_sched_out() in __perf_remove_from_context().

The group_sched_out() is my mistake, But with update in
event_sched_out(). It's enough to fix and I think it's right fix.