linux-kernel - Re: [PATCH v3] perf/core: Fix missing read event generation on task exit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <61e544ab-d9b9-454e-9dd7-65625d9be126@linaro.org>
Date: Mon, 9 Feb 2026 16:12:39 +0000
From: James Clark <james.clark@...aro.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Thaumy Cheng <thaumy.love@...il.com>, linux-perf-users@...r.kernel.org,
 linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>,
 Arnaldo Carvalho de Melo <acme@...nel.org>,
 Namhyung Kim <namhyung@...nel.org>, Mark Rutland <mark.rutland@....com>,
 Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
 Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
 Adrian Hunter <adrian.hunter@...el.com>,
 Kan Liang <kan.liang@...ux.intel.com>,
 Suzuki K Poulose <Suzuki.Poulose@....com>, Leo Yan <leo.yan@....com>,
 Mike Leach <mike.leach@...aro.org>
Subject: Re: [PATCH v3] perf/core: Fix missing read event generation on task
 exit



On 06/02/2026 3:29 pm, Peter Zijlstra wrote:
> On Fri, Feb 06, 2026 at 11:21:19AM +0000, James Clark wrote:
> 
>> I've been looking into a regression caused by this commit and didn't manage
>> to come up with a fix. But shouldn't this be something more like:
>>
>>    if (attach_state & PERF_ATTACH_CHILD && event_filter_match(event))
>>        sync_child_event(event, task);
>>
>> As in, you only want to call sync_child_event() and write stuff to the ring
>> buffer for the CPU that is currently running this exit handler? Although
>> this change affects the 'total_time_enabled' tracking as well, but I'm not
>> 100% sure if we're not double counting it anyway.
>>
>>  From perf_event_exit_task_context(), perf_event_exit_event() is called on
>> all events, which includes events on other CPUs:
>>
>>    list_for_each_entry_safe(child_event, next, &ctx->event_list, ...)
>>      perf_event_exit_event(child_event, ctx, exit ? task : NULL, false);
>>
>> Then we write into those other CPU's ring buffers, which don't support
>> concurrency.
>>
>> The reason I found this is because we have a tracing test that spawns some
>> threads and then looks for PERF_RECORD_AUX events. When there are concurrent
>> writes into the ring buffers, rb->nest tracking gets messed up leaving the
>> count positive even after all nested writers have finished. Then all future
>> writes don't copy the data_head pointer to the user page (because it thinks
>> someone else is writing), so Perf doesn't copy out any data anymore leaving
>> records missing.
>>
>> An easy reproducer is to put a warning that the ring buffer being written to
>> is the correct one:
>>
>>    @@ -41,10 +41,11 @@ static void perf_output_get_handle(struct
>>    perf_output_handle *handle)
>>    {
>>   	struct perf_buffer *rb = handle->rb;
>>
>>   	preempt_disable();
>>
>>    +	WARN_ON(handle->event->cpu != smp_processor_id());
>>
>>
>> And then record:
>>
>>    perf record -s -- stress -c 8 -t 1
>>
>> Which results in:
>>
>>    perf_output_begin+0x320/0x480 (P)
>>    perf_event_exit_event+0x178/0x2c0
>>    perf_event_exit_task_context+0x214/0x2f0
>>    perf_event_exit_task+0xb0/0x3b0
>>    do_exit+0x1bc/0x808
>>    __arm64_sys_exit+0x28/0x30
>>    invoke_syscall+0x4c/0xe8
>>    el0_svc_common+0x9c/0xf0
>>    do_el0_svc+0x28/0x40
>>    el0_svc+0x50/0x240
>>    el0t_64_sync_handler+0x78/0x130
>>    el0t_64_sync+0x198/0x1a0
>>
>> I suppose there is a chance that this is only an issue when also doing
>> perf_aux_output_begin()/perf_aux_output_end() from start/stop because that's
>> where I saw the real race? Maybe without that, accessing the rb from another
>> CPU is ok because there is some locking, but I think this might be a more
>> general issue.
> 
> I *think* something like so.
> 
> Before the patch in question this would never happen, because of calling
> things too late and always hitting that TASK_TOMBSTONE.
> 
> But irrespective of emitting that event, we do want to propagate the
> count and runtime numbers.
> 
> 
> ---
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 5b5cb620499e..f566ad55b4fb 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -14086,7 +14086,7 @@ static void sync_child_event(struct perf_event *child_event,
>   	u64 child_val;
>   
>   	if (child_event->attr.inherit_stat) {
> -		if (task && task != TASK_TOMBSTONE)
> +		if (task && task != TASK_TOMBSTONE && event_filter_match(child_event))
>   			perf_event_read_event(child_event, task);
>   	}
>   

Turns out I tested this before with "child_event->cpu == 
raw_smp_processor_id()" rather than using event_filter_match() so I 
missed that the loop over all the events needs to be wrapped with 
preempt_disable(). But that can't be done because 
perf_event_exit_event() takes a mutex.

I don't think the preempt_disable() can be on any smaller region than 
outside the entire loop otherwise you can get rescheduled between 
event_filter_match() checks and end up failing them all and not writing 
any event out at all.

While debugging I also noticed another issue with these per-thread count 
records. perf_event_exit_event() only does anything if the event has a 
parent. But the context switch optimization means that sometimes threads 
re-use the original event which has no parent. So randomly you get 
threads that are missing from the output.

There is a comment that mentions this under the parent check:

   if (parent_event) {
     /*
      * Do not destroy the 'original' grouping; because of the
      * context switch optimization the original events could've
      * ended up in a random child task.

But I'm not sure if that was supposed to be worked around some other way 
and it's now broken, or it was a known limitation of the implementation 
from the beginning? But right now it randomly misses one of the threads 
and includes the main thread counts, or includes all the threads and 
doesn't include the main thread counts if no context switch optimisation 
was done.

The perf record docs don't say anything that you wouldn't expect all 
threads to be there:

     -s, --stat            per thread counts