[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20260206152907.GQ1395266@noisy.programming.kicks-ass.net>
Date: Fri, 6 Feb 2026 16:29:07 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: James Clark <james.clark@...aro.org>
Cc: Thaumy Cheng <thaumy.love@...il.com>, linux-perf-users@...r.kernel.org,
linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Namhyung Kim <namhyung@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
Kan Liang <kan.liang@...ux.intel.com>,
Suzuki K Poulose <Suzuki.Poulose@....com>,
Leo Yan <leo.yan@....com>, Mike Leach <mike.leach@...aro.org>
Subject: Re: [PATCH v3] perf/core: Fix missing read event generation on task
exit
On Fri, Feb 06, 2026 at 11:21:19AM +0000, James Clark wrote:
> I've been looking into a regression caused by this commit and didn't manage
> to come up with a fix. But shouldn't this be something more like:
>
> if (attach_state & PERF_ATTACH_CHILD && event_filter_match(event))
> sync_child_event(event, task);
>
> As in, you only want to call sync_child_event() and write stuff to the ring
> buffer for the CPU that is currently running this exit handler? Although
> this change affects the 'total_time_enabled' tracking as well, but I'm not
> 100% sure if we're not double counting it anyway.
>
> From perf_event_exit_task_context(), perf_event_exit_event() is called on
> all events, which includes events on other CPUs:
>
> list_for_each_entry_safe(child_event, next, &ctx->event_list, ...)
> perf_event_exit_event(child_event, ctx, exit ? task : NULL, false);
>
> Then we write into those other CPU's ring buffers, which don't support
> concurrency.
>
> The reason I found this is because we have a tracing test that spawns some
> threads and then looks for PERF_RECORD_AUX events. When there are concurrent
> writes into the ring buffers, rb->nest tracking gets messed up leaving the
> count positive even after all nested writers have finished. Then all future
> writes don't copy the data_head pointer to the user page (because it thinks
> someone else is writing), so Perf doesn't copy out any data anymore leaving
> records missing.
>
> An easy reproducer is to put a warning that the ring buffer being written to
> is the correct one:
>
> @@ -41,10 +41,11 @@ static void perf_output_get_handle(struct
> perf_output_handle *handle)
> {
> struct perf_buffer *rb = handle->rb;
>
> preempt_disable();
>
> + WARN_ON(handle->event->cpu != smp_processor_id());
>
>
> And then record:
>
> perf record -s -- stress -c 8 -t 1
>
> Which results in:
>
> perf_output_begin+0x320/0x480 (P)
> perf_event_exit_event+0x178/0x2c0
> perf_event_exit_task_context+0x214/0x2f0
> perf_event_exit_task+0xb0/0x3b0
> do_exit+0x1bc/0x808
> __arm64_sys_exit+0x28/0x30
> invoke_syscall+0x4c/0xe8
> el0_svc_common+0x9c/0xf0
> do_el0_svc+0x28/0x40
> el0_svc+0x50/0x240
> el0t_64_sync_handler+0x78/0x130
> el0t_64_sync+0x198/0x1a0
>
> I suppose there is a chance that this is only an issue when also doing
> perf_aux_output_begin()/perf_aux_output_end() from start/stop because that's
> where I saw the real race? Maybe without that, accessing the rb from another
> CPU is ok because there is some locking, but I think this might be a more
> general issue.
I *think* something like so.
Before the patch in question this would never happen, because of calling
things too late and always hitting that TASK_TOMBSTONE.
But irrespective of emitting that event, we do want to propagate the
count and runtime numbers.
---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b5cb620499e..f566ad55b4fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -14086,7 +14086,7 @@ static void sync_child_event(struct perf_event *child_event,
u64 child_val;
if (child_event->attr.inherit_stat) {
- if (task && task != TASK_TOMBSTONE)
+ if (task && task != TASK_TOMBSTONE && event_filter_match(child_event))
perf_event_read_event(child_event, task);
}
Powered by blists - more mailing lists