linux-kernel - Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aD8PM7qx7vCDhs8a@e129823.arm.com>
Date: Tue, 3 Jun 2025 16:05:23 +0100
From: Yeoreum Yun <yeoreum.yun@....com>
To: Leo Yan <leo.yan@....com>
Cc: peterz@...radead.org, mingo@...hat.com, mingo@...nel.org,
	acme@...nel.org, namhyung@...nel.org, mark.rutland@....com,
	alexander.shishkin@...ux.intel.com, jolsa@...nel.org,
	irogers@...gle.com, adrian.hunter@...el.com,
	kan.liang@...ux.intel.com, linux-perf-users@...r.kernel.org,
	linux-kernel@...r.kernel.org, David Wang <00107082@....com>
Subject: Re: [PATCH 1/1] perf/core: fix dangling cgroup pointer in cpuctx

Hi Leo,
>
> On Mon, Jun 02, 2025 at 07:40:49PM +0100, Yeoreum Yun wrote:
> > commit a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > changes the event->state update before list_del_event().
> > This change prevents calling perf_cgroup_event_disable() as a result,
> > cpuctx->cgrp can't be cleared properly and point to dangling point of cgroup.
> >
> > Because of this problem, some machin meets the below panic[0]:
> >
> > 863.881960] sysved_call_function_sing le+0x4c/0xc0
> > 863.881301] asm_sysvec_call_function_single+0x16/0x20
> > 869.881344] RIP: 0633:0x7f9alcea3367
> > 663.681373] Code: 00 66 99 b8 ff ff ff ff c3 66 ....
> > 863.881524] RSP: 002b:00007fffa526fcf8 EFLAGS: 00000246
> > 869.881567] RAX: 0000562060c962d0 RBX: 0000000000000002 RCX: 00007f9a1cff1c60
> > 863.881625] RDX: 00007f9a0c000030 RSI: 00007f9alcff1c60 RDI: 00007f9a1ca91c20
> > 863.081682] RBP: 0000000000000001 R08: 0000000000000000 R09: 00007f9a1d6217a0
> > 869.881740] R10: 00007f9alca91c10 R11: 0000000000000246 R12: 00007f9a1d70c020
> > 869.881798] R13: 00007fffa5270030 R14: 00007fffa526fd00 R15: 0000000000000000
> > 863.881860] </TASK>
> > 863.881876) Modules linked in: snd_seq_dummy (E) snd_hrtimer (E)...
> > ...
> > 863.887142] button (E)
> > 863.912127] CR2: ffffe4afcc079650
> > 863.914593] --- [ end trace 0000000000000000 1--
> > 864.042750] RIP: 0010:ctx_sched_out+0x1ce/0x210
> > 864.045214] Code: 89 c6 4c 8b b9 de 00 00 00 48 ...
> > 864.050343] RSP: 0000:ffffaa4ec0f3fe60 EFLAGS: 00010086
> > 864.052929] RAX: 0000000000000002 RBX: ffff8e8eeed2a580 RCX: ffff8e8bded9bf00
> > 864.055518] RDX: 000000c92340b051 RSI: 000000c92340b051 RDI: ffff
> > 864.058093] RBP: 0000000000000000 R08: 0000000000000002 R09: 00
> > 864.060654] R10: 0000000000000000 R11: 0000000000000000 R12: 000
> > 864.063183] R13: ffff8e8eeed2a580 R14: 0000000000000007 R15: ffffe4afcc079650
> > 864.065729] FS: 00007f9a1ca91940 (0000) GS:ffff8e8f6b1c3000(0000) knIGS:0000000000000000
> > 864.068312] CS: 0010 DS: 0000 ES: 0000 CRO: 0000000080050033
> > 864.070898] CR2: ffffe4afcc079650 CR3: 00000001136d8000 CR4: 0000000000350ef0
> > 864.673523] Kernel panic - not syncing: Fatal exception in interrupt
> > 864.076410] Kernel Offset: 0xc00000 from 0xffffffff81000000 (relocation range: 0xff
> > 864.205401] --- [ end Kernel panic - not syncing: Fatal exception in interrupt ]---
> >
> > To address this call the perf_cgroup_event_disable() properly before
> > list_del_event() in __perf_remove_from_context().
> >
> > Link: https://lore.kernel.org/all/aD2TspKH%2F7yvfYoO@e129823.arm.com/ [0]
> > Fixes: a3c3c6667("perf/core: Fix child_total_time_enabled accounting bug at task exit")
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@....com>
> > Tested-by: David Wang <00107082@....com>
> > ---
> >  kernel/events/core.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index f34c99f8ce8f..909b9d5a65c1 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2498,6 +2498,10 @@ __perf_remove_from_context(struct perf_event *event,
> >  		state = PERF_EVENT_STATE_DEAD;
> >  	}
> >  	event_sched_out(event, ctx);
> > +
> > +	if (event->state > PERF_EVENT_STATE_OFF)
> > +		perf_cgroup_event_disable(event, ctx);
> > +
>
> As we discussed, seems to me, the issue is caused by an ambigous state
> machine transition:
>
> When a PMU event state is PERF_EVENT_STATE_EXIT, the current code does
> not transite the state to PERF_EVENT_STATE_OFF. As a result, the
> list_del_event() function skips to clean up cgroup pointer for non OFF
> states. This is different from the code prior to the commit a3c3c6667,
> which transits states EXIT -> INACTIVE -> OFF.

Just for correct this, prior to commit a3c3c6667,
Isn't the state change INACTIVE -> OFF -> EXIT?

when I saw the perf_event_exit_event(),
  perf_remove_from_context(event, detacg_flags
  ...
  perf_event_set_state(event, PERF_EVENT_STATE_EXIT),

So,

  1. event_sched_out() -> INACTIVE
  2. list_del_event() -> OFF
  3. perf_event_set_state() -> EXIT.

Therefore, the final event state is "PREF_EVENT_STATE_EXIT"

Therefore, I think calling "perf_cgroup_event_disable()" is called
as soon as event_sched_out() since commit a3c3c6667 change the state
INCATIVE -> EXIT.

so, the middle state of OFF is gone and I think this middle state "OFF"
doesn't affect any race condition since the removed event "wakeup" is
also called after it changed to "final state" -- in this case, the
PERF_EVENT_STATE_EXIT.

>
> My suggestion is not reliable. Roughly read code, except for the
> PERF_EVENT_STATE_EXIT case, I think other error cases should also clean
> up the cgroup pointer.  The reason is I don't see other places to
> clean up the cgroup pointer for these error cases:
>
>   PERF_EVENT_STATE_REVOKED
>   PERF_EVENT_STATE_DEAD
>
> Only in the PERF_EVENT_STATE_ERROR state, we don't need to cleanup
> cgroup as this has already been handled in merge_sched_in().
>
> So a correct condition would be:
>
>     if (event->state > PERF_EVENT_STATE_OFF ||
>         event->state <= PERF_EVENT_STATE_EXIT)
>         perf_cgroup_event_disable(event, ctx);
>
> And we need to remove the perf_cgroup_event_disable() from
> list_del_event() to avoid duplicate code.

If we sustain perf_cgroup_event_disable()'s call in *list_del_event()*
But, Should it be? I think its enough to call as soon as after event_sched_out().

>
> Perhaps a better approach for code consolidation would be to modify
> the conditions in list_del_event() to ensure the cgroup pointer is
> cleaned up in error cases. However, I'm not confident that this is the
> correct direction, so I would wait for suggestions from the maintainers.
>
> Thanks,
> Leo
[...]

--
Sincerely,
Yeoreum Yun