linux-kernel - Re: [PATCH v3] perf/core: Fix missing read event generation on task exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20260206152907.GQ1395266@noisy.programming.kicks-ass.net>
Date: Fri, 6 Feb 2026 16:29:07 +0100
From: Peter Zijlstra <peterz@...radead.org>
To: James Clark <james.clark@...aro.org>
Cc: Thaumy Cheng <thaumy.love@...il.com>, linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
	Arnaldo Carvalho de Melo <acme@...nel.org>,
	Namhyung Kim <namhyung@...nel.org>,
	Mark Rutland <mark.rutland@....com>,
	Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
	Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
	Adrian Hunter <adrian.hunter@...el.com>,
	Kan Liang <kan.liang@...ux.intel.com>,
	Suzuki K Poulose <Suzuki.Poulose@....com>,
	Leo Yan <leo.yan@....com>, Mike Leach <mike.leach@...aro.org>
Subject: Re: [PATCH v3] perf/core: Fix missing read event generation on task
 exit

On Fri, Feb 06, 2026 at 11:21:19AM +0000, James Clark wrote:

> I've been looking into a regression caused by this commit and didn't manage
> to come up with a fix. But shouldn't this be something more like:
> 
>   if (attach_state & PERF_ATTACH_CHILD && event_filter_match(event))
>       sync_child_event(event, task);
> 
> As in, you only want to call sync_child_event() and write stuff to the ring
> buffer for the CPU that is currently running this exit handler? Although
> this change affects the 'total_time_enabled' tracking as well, but I'm not
> 100% sure if we're not double counting it anyway.
> 
> From perf_event_exit_task_context(), perf_event_exit_event() is called on
> all events, which includes events on other CPUs:
> 
>   list_for_each_entry_safe(child_event, next, &ctx->event_list, ...)
>     perf_event_exit_event(child_event, ctx, exit ? task : NULL, false);
> 
> Then we write into those other CPU's ring buffers, which don't support
> concurrency.
> 
> The reason I found this is because we have a tracing test that spawns some
> threads and then looks for PERF_RECORD_AUX events. When there are concurrent
> writes into the ring buffers, rb->nest tracking gets messed up leaving the
> count positive even after all nested writers have finished. Then all future
> writes don't copy the data_head pointer to the user page (because it thinks
> someone else is writing), so Perf doesn't copy out any data anymore leaving
> records missing.
> 
> An easy reproducer is to put a warning that the ring buffer being written to
> is the correct one:
> 
>   @@ -41,10 +41,11 @@ static void perf_output_get_handle(struct
>   perf_output_handle *handle)
>   {
>  	struct perf_buffer *rb = handle->rb;
> 
>  	preempt_disable();
> 
>   +	WARN_ON(handle->event->cpu != smp_processor_id());
> 
> 
> And then record:
> 
>   perf record -s -- stress -c 8 -t 1
> 
> Which results in:
> 
>   perf_output_begin+0x320/0x480 (P)
>   perf_event_exit_event+0x178/0x2c0
>   perf_event_exit_task_context+0x214/0x2f0
>   perf_event_exit_task+0xb0/0x3b0
>   do_exit+0x1bc/0x808
>   __arm64_sys_exit+0x28/0x30
>   invoke_syscall+0x4c/0xe8
>   el0_svc_common+0x9c/0xf0
>   do_el0_svc+0x28/0x40
>   el0_svc+0x50/0x240
>   el0t_64_sync_handler+0x78/0x130
>   el0t_64_sync+0x198/0x1a0
> 
> I suppose there is a chance that this is only an issue when also doing
> perf_aux_output_begin()/perf_aux_output_end() from start/stop because that's
> where I saw the real race? Maybe without that, accessing the rb from another
> CPU is ok because there is some locking, but I think this might be a more
> general issue.

I *think* something like so.

Before the patch in question this would never happen, because of calling
things too late and always hitting that TASK_TOMBSTONE.

But irrespective of emitting that event, we do want to propagate the
count and runtime numbers.


---
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 5b5cb620499e..f566ad55b4fb 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -14086,7 +14086,7 @@ static void sync_child_event(struct perf_event *child_event,
 	u64 child_val;
 
 	if (child_event->attr.inherit_stat) {
-		if (task && task != TASK_TOMBSTONE)
+		if (task && task != TASK_TOMBSTONE && event_filter_match(child_event))
 			perf_event_read_event(child_event, task);
 	}