[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z2V5s2JM4W7BRckR@e129823.arm.com>
Date: Fri, 20 Dec 2024 14:05:39 +0000
From: Yeoreum Yun <yeoreum.yun@....com>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, acme@...nel.org, namhyung@...nel.org,
mark.rutland@....com, alexander.shishkin@...ux.intel.com,
jolsa@...nel.org, irogers@...gle.com, adrian.hunter@...el.com,
kan.liang@...ux.intel.com, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] events/core: fix acoount failure for event's
total_enable_time
Hi, Peter.
On Fri, Dec 20, 2024 at 02:33:59PM +0100, Peter Zijlstra wrote:
> On Fri, Dec 20, 2024 at 10:02:02AM +0000, Yeoreum Yun wrote:
> > The perf core code fails to account for total_enable_time of event
> > when its state is inactive.
> >
> > Here is the error case for failure to account for total_enable_time for
> > core event:
> >
> > sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
> > ...
> > armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
> > armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
>
> So *that* number is wrong and should have been equal to the number above
> it.
Yes it is.
> >
> > Performance counter stats for 'stress-ng --pthread=2 -t 2s':
> >
> > 1,138,698,008 armv8_pmuv3_0/event=0x08/ (94.99%)
> > 1,826,791,390 armv8_pmuv3_1/event=0x08/ (43.47%)
> >
> > Since above two events are belong to the same task context and
> > mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
> > the total_enable_time should be the same (the 2nd number among 3 numbers)
> > and the summation of ratio should be 100%.
>
> So the enable_time for the second event is too low, meaning the ratio
> hsould have been ~15% rather than ~43%.
>
> /me goes try and congest the wall of text below for more clues :/
>
> > This account failure of total_enable_time for event could happen in below sequence.
> >
> > 1. two event opened with:
> > - first event (e0) is opened with pmu0(p0) which could be added on CPU0.
> > - second event (e1) is opened with pmu1(p1) which could be added on CPU1.
> > - these two event belongs to the same task_ctx.
> >
> > at this point:
> > task_ctx->time = 0
> > e0->total_enable_time = 0
> > e0->total_running_time = 0
> > e1->total_enable_time = 0
> > e1->total_running_time = 0
> >
> > 2. the task_ctx is sched in CPU0.
> > - In ctx_sched_in(), the task_ctx->time doesn't updated.
> > - In event_sched_in() e0 is activated so, its state becomes ACTIVE.
> > - In event_sched_in() e1 is activated, but soon becomes !ACTIVE
> > because pmu1 doesn't support CPU1 so it failed in pmu1->add().
>
> This doesn't make sense; e1 should never reach event_sched_in() and it
> should already be INACTIVE.
>
> Notable events are created INACTIVE when !attr->disabled.
But in perf stat code, it via enable_counter(), so it's set with
INACTIVE.
> Also, scheduling should not get beyond merge_sched_in()'s
> event_filter_match(), which will find the CPU is a mismatch and stop
> right there.
>
> This also means the event (e1) does not get to go on flexible_active
> (see below).
No, when perf stat command with above, the cpu sets as == -1,
So, It doesn't filter out in event_filter_match(). so it enter into
merge_sched_in() and get into event_sched_in().
> > at this point:
> > task_ctx->time = 0
> > e0->total_enable_time = 0
> > e0->total_running_time = 0
> > e1->total_enable_time = 0
> > e1->total_running_time = 0
>
> Both have the same state, surely that could've been compressed somewhat.
>
> /me deletes giant wall of text
>
> > 7. the task exit and task_ctx is sched out. and suppose scheulded time as t3.
> > - In ctx_sched_out(), the task_ctx->time updated with t1 + t2 + t3.
> > - e0 was ACTIVE, so In event_sched_out(),
> > e0->total_enable_time and e1->total_running_time are updated with t3.
> > - e1 was !ACTIVE, so it doesn't update
> > e1->total_enable_time and e1->total_running_time.
> >
> > at this point:
> > task_ctx->time = t1 + t2 + t3
> > e0->total_enable_time = t1 + t2 +t3
> > e0->total_running_time = t1 + t3
> > e1->total_enable_time = t1 + t2
> > e1->total_running_time = t2
> >
> > In step 7, amount of t3 enable time is missing in e1.
> > So, the perf prints error ratio for total_running_time/total_enable_time
> > for e1.
> >
> > Here is the summary of change of time as timeline:
> >
> > CPU0 CPU1 CPU0
> > | in | out | | in | out | | in | out (exit) |
> > ------------------------------------------------------------------------------
> > ctx->time | 0 | t1 | | t1 | t1 + t2 | | t1 + t2 | t1 + t2 + t3 |
> > ------------------------------------------------------------------------------
> > e0->ena | 0 | t1 | | t1 | t1 *| | t1 + t2 | t1 + t2 + t3 |
> > ------------------------------------------------------------------------------
> > e0->run | 0 | t1 | | t1 | t1 | | t1 + t2 | t1 + t3 |
> > ------------------------------------------------------------------------------
> > e1->ena | 0 | 0 *| | t1 | t1 + t2 | | t1 + t2 | t1 + t2 XXX |
> > ------------------------------------------------------------------------------
> > e1->run | 0 | 0 | | 0 | t2 | | t1 + t2 | t2 |
> > ------------------------------------------------------------------------------
>
> So just the table is fine (better even than this giant wall of text),
> but you have to explain what's what. Eg, there is no clear progression,
> and having two columns both labeled CPU0 is highly confusing.
>
> Also, a visual indication of where the error happens is useful. So the
> cell marked with XXX is the failure case, right, that's missing '+ t3'.
>
> Except then you should also mark those I stuck a * on.
>
Thanks. I'll rewrite it again.
> > To address this, update total_enable_time in event_sched_out() when event state
> > is PERF_EVENT_STATE_INACTIVE.
>
> This is a giant jump that I'm not following. Notably ctx_sched_out()
> will only iterate pmu_ctx->{pinned,flexible}_active and that list should
> only include ACTIVE events.
> So how does handling INACTIVE in event_sched_out() even begin to help?
the answer is in the perf_event_exit_event()'s
perf_remove_from_context(). in here
event_sched_out() is called via __perf_remove_from_context()
So above case, the enable time is fixed in here.
But I think in case of change of group_ctx_out() is my mistak. Sorry.
>
> Additionally, where does perf-stat get those numbers from; because
> afaict perf_event_read() will do perf_event_update_time() and you should
> be getting the correct numbers, no?
>
> I'm utterly lost in the perf-stat code :-( Also it appears I can't even
> build tools/perf anymore :-(
>
After exit task. in this case, the ctx->is_active doesn't have
EVENT_TIME, it doesn't update ctx->time and the event time.
So, the task_ctx which is child one is callected with missing
total_enable_time.
> > After this patch, this problem is gone like:
>
> Yes, but why? Something smells.
> Looking at group_sched_in(), I'm thinking that that STATE_OFF thing is
> totally wrong, should that not be something like:
>
> if (group_event->state <= PERF_EVENT_STATE_OFF)
> return -EBUSY;
>
> or something !0 at any rate.
>
>
> So what path leads to event_sched_out() with !ACTIVE that's relevant
> here?
>
> It can't have geen that group_sched_out(), because that wasn't in your
> first patch.
>
> How is this a fix and why is it the right fix?
>
>
As I said above, at the time of exit. __perf_remove_from_context()
calls in event_sched_out(). so it fixes the missing enabled time at the
exit.
the middle of event''s total_enable_time() is updated in ctx_sched_in()
(as I above said, it doesn't filter out because it runs with cpu == -1).
So, it's enough to add missing enabled time with exit() and this is
fixed via event_sched_out() in __perf_remove_from_context().
The group_sched_out() is my mistake, But with update in
event_sched_out(). It's enough to fix and I think it's right fix.
Powered by blists - more mailing lists