lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-Id: <20250306123350.1650114-1-yeoreum.yun@arm.com>
Date: Thu,  6 Mar 2025 12:33:50 +0000
From: Yeoreum Yun <yeoreum.yun@....com>
To: mingo@...nel.org,
	peterz@...radead.org,
	acme@...nel.org,
	namhyung@...nel.org,
	Mark.Rutland@....com,
	alexander.shishkin@...ux.intel.com,
	jolsa@...nel.org,
	irogers@...gle.com,
	adrian.hunter@...el.com,
	kan.liang@...ux.intel.com,
	leo.yan@....com,
	james.clark@...aro.org
Cc: linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Yeoreum Yun <yeoreum.yun@....com>
Subject: [PATCH v4] events/core: fix acoount failure for event's child_total_enable_time at task exit

The perf core code fails to account for total_enable_time of event
when its state is inactive.

Here is the error case for failure to account for total_enable_time for
core event:

sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 2s
...

/*
 * three number represetns each
 *   scaled count / total_enable_time / total_total_running_time
 */
armv8_pmuv3_0/event=0x08/: 1138698008 2289429840 2174835740
                                      ^^^^^^^^^^
armv8_pmuv3_1/event=0x08/: 1826791390 1950025700 847648440
                                      ^^^^^^^^^^

 Performance counter stats for 'stress-ng --pthread=2 -t 2s':

     1,138,698,008      armv8_pmuv3_0/event=0x08/                                               (94.99%)
     1,826,791,390      armv8_pmuv3_1/event=0x08/                                               (43.47%)

Since above two events are belong to the same task context and
mutually-exclusive per cpu (they couldn't be active at the same time on the same cpu),
the total_enable_time should be the same (marked with ^^^^^^^^^^^)
and the summation of ratio should be 100%.

This account failure of total_enable_time because of
account failure of child_total_enable_time of child event when
child task exit.

Below table explains how the child_total_enable_time is failed to account at
exiting child task which switch cpus as time passes by (CPU0 -> CPU1 -> CPU0)

 - in means sched_in.
 - out means sched_out.
 - exit means at the exit of child task.
   NOTE: the value is before calling list_del_event(). which mean
         the value at exit column will be added at parent event's
          child_total_enable_time when child task exit.
 - ctx is the child_task_ctx,
 - e0 is the child_event which set with cpu == -1 and opened with pmu0 only
   added in CPU0,
 - e1 is the child_event which set with cpu == -1 and opened with pmu1 only
   added in CPU1,
 - e0 and e1 belongs to same child_task_ctx.

          CPU0 (run t1)        CPU1 (run t2)          CPU0 (run t3)
          |  in | out |       | in  | out     |       | in      | exit        |
------------------------------------------------------------------------------
ctx->time |  0  | t1  |       | t1  | t1 + t2 |       | t1 + t2 | t1 + t2 + t3|
------------------------------------------------------------------------------
e0->ena   |  0  | t1  |       | t1  | t1     *|       | t1 + t2 | t1 + t2 + t3|
------------------------------------------------------------------------------
e0->run   |  0  | t1  |       | t1  | t1     *|       | t1      | t1 + t3     |
------------------------------------------------------------------------------
e1->ena   |  0  | 0  *|       | t1  | t1 + t2 |       | t1 + t2 | t1 + t2    X|
------------------------------------------------------------------------------
e1->run   |  0  | 0  *|       | 0   | t2      |       | t2      | t2         X|
------------------------------------------------------------------------------

The value marked with * means it doesn't updates since event->state was
INACTIVE.

Please see the last CPU0's column with exit (marked with X).
Since e1's state is INACTIVE its total_enable_time doesn't update
and it remains with former value without accounting t3 time.

In this situation, at __perf_remove_from_context() while exit child_task,
sync_child_event() where adds child_event's total_enable_time to
parent event's child_total_enable_time in perf_child_detach() is called
before list_del_event() in which event time is updated by setting
the event state as OFF.
That means child_total_enable_time is added with missing
amount of last enable time -- t3.

In case of parent event's total_enable_time is updated properly in
list_del_event() when the task exit.
However, the child_total_enable_time is missed when child_task exited,
the perf prints error amount of enable_time (which is summation of
total_enable_time + child_total_enable_time).

To address this, update event state via perf_event_state() in
__perf_remove_from_context() and call it before perf_child_detach()
where parent's total_enable_time is updated.

After this patch, this problem is gone like:

sudo ./perf stat -vvv -e armv8_pmuv3_0/event=0x08/ -e armv8_pmuv3_1/event=0x08/ -- stress-ng --pthread=2 -t 10s
...
armv8_pmuv3_0/event=0x08/: 15396770398 32157963940 21898169000
armv8_pmuv3_1/event=0x08/: 22428964974 32157963940 10259794940

 Performance counter stats for 'stress-ng --pthread=2 -t 10s':

    15,396,770,398      armv8_pmuv3_0/event=0x08/                                               (68.10%)
    22,428,964,974      armv8_pmuv3_1/event=0x08/                                               (31.90%)

Signed-off-by: Yeoreum Yun <yeoreum.yun@....com>
Suggsted-by: Peter Zijlstra <peterz@...radead.org>
---
 kernel/events/core.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6364319e2f88..058533a50493 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2407,6 +2407,7 @@ ctx_time_update_event(struct perf_event_context *ctx, struct perf_event *event)
 #define DETACH_GROUP	0x01UL
 #define DETACH_CHILD	0x02UL
 #define DETACH_DEAD	0x04UL
+#define DETACH_EXIT	0x08UL

 /*
  * Cross CPU call to remove a performance event
@@ -2421,6 +2422,7 @@ __perf_remove_from_context(struct perf_event *event,
 			   void *info)
 {
 	struct perf_event_pmu_context *pmu_ctx = event->pmu_ctx;
+	enum perf_event_state state = PERF_EVENT_STATE_OFF;
 	unsigned long flags = (unsigned long)info;

 	ctx_time_update(cpuctx, ctx);
@@ -2429,16 +2431,19 @@ __perf_remove_from_context(struct perf_event *event,
 	 * Ensure event_sched_out() switches to OFF, at the very least
 	 * this avoids raising perf_pending_task() at this time.
 	 */
-	if (flags & DETACH_DEAD)
+	if (flags & DETACH_EXIT)
+		state = PERF_EVENT_STATE_EXIT;
+	if (flags & DETACH_DEAD) {
 		event->pending_disable = 1;
+		state = PERF_EVENT_STATE_DEAD;
+	}
 	event_sched_out(event, ctx);
+	perf_event_set_state(event, min(event->state, state));
 	if (flags & DETACH_GROUP)
 		perf_group_detach(event);
 	if (flags & DETACH_CHILD)
 		perf_child_detach(event);
 	list_del_event(event, ctx);
-	if (flags & DETACH_DEAD)
-		event->state = PERF_EVENT_STATE_DEAD;

 	if (!pmu_ctx->nr_events) {
 		pmu_ctx->rotate_necessary = 0;
@@ -13424,12 +13429,7 @@ perf_event_exit_event(struct perf_event *event, struct perf_event_context *ctx)
 		mutex_lock(&parent_event->child_mutex);
 	}

-	perf_remove_from_context(event, detach_flags);
-
-	raw_spin_lock_irq(&ctx->lock);
-	if (event->state > PERF_EVENT_STATE_EXIT)
-		perf_event_set_state(event, PERF_EVENT_STATE_EXIT);
-	raw_spin_unlock_irq(&ctx->lock);
+	perf_remove_from_context(event, detach_flags | DETACH_EXIT);

 	/*
 	 * Child events can be freed.
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ