[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250502102918.GW4198@noisy.programming.kicks-ass.net>
Date: Fri, 2 May 2025 12:29:18 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>, LKML <linux-kernel@...r.kernel.org>,
"Liang, Kan" <kan.liang@...ux.intel.com>,
Adrian Hunter <adrian.hunter@...el.com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Ian Rogers <irogers@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Namhyung Kim <namhyung@...nel.org>,
Ravi Bangoria <ravi.bangoria@....com>,
linux-perf-users@...r.kernel.org
Subject: Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
On Mon, Apr 28, 2025 at 01:11:47PM +0200, Frederic Weisbecker wrote:
> Le Thu, Apr 24, 2025 at 06:30:24PM +0200, Peter Zijlstra a écrit :
> > On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> > > @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
> > > * Do destroy all inherited groups, we don't care about those
> > > * and being thorough is better.
> > > */
> > > - detach_flags |= DETACH_GROUP | DETACH_CHILD;
> > > + prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
> > > mutex_lock(&parent_event->child_mutex);
> > > }
> > >
> > > if (revoke)
> > > - detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> > > + prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> > >
> > > - perf_remove_from_context(event, detach_flags);
> > > + perf_remove_from_context(event, &prd);
> >
> > Isn't all this waay to complicated?
> >
> > That is, to modify state we need both ctx->mutex and ctx->lock, and this
> > is what __perf_remove_from_context() has, but because of this, holding
> > either one of those locks is sufficient to read the state -- it cannot
> > change.
> >
> > And here we already hold ctx->mutex.
> >
> > So can't we simply do:
> >
> > old_state = event->attach_state;
> > perf_remove_from_context(event, detach_flags);
> >
> > // do whatever with old_state
>
> Right, the locking scenario is just a bit more complicated.
> Most flags are set on init or with both ctx mutex and lock.
> But:
>
> _ PERF_ATTACH_CHILD is set instead with parent child_mutex and ctx lock.
Looks trivial to add ctx->mutex to the mix here. Its not like that's a
fast path. But let me go read your patch before deciding if that's
actually needed :-)
> _ PERF_ATTACH_ITRACE is set from pmu::start(). Thus from the event context
> with just interrupt disabled. It's probably enough to synchronize against
> initialization and remove_from_context IPIs but perf_event_exit_event() needs
> some care.
Right, that's a little tricky indeed. As stated, we don't care about the
bit, but the write shouldn't mess things up.
> So we must hold both ctx mutex and child_mutex (although the pmus_srcu thing
> should make that PERF_ATTACH_CHILD thing visible but let's keep things obvious).
> And also have WRITE_ONCE() / READ_ONCE() to take care about PERF_ATTACH_ITRACE,
> which we don't care about anyway.
>
> Now this looks like this:
>
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 7bcb02ffb93a..7278ca731a55 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -208,7 +208,6 @@ static void perf_ctx_unlock(struct perf_cpu_context *cpuctx,
> }
>
> #define TASK_TOMBSTONE ((void *)-1L)
> -#define EVENT_TOMBSTONE ((void *)-1L)
>
> static bool is_kernel_event(struct perf_event *event)
> {
> @@ -2338,12 +2337,6 @@ static void perf_child_detach(struct perf_event *event)
>
> sync_child_event(event);
> list_del_init(&event->child_list);
> - /*
> - * Cannot set to NULL, as that would confuse the situation vs
> - * not being a child event. See for example unaccount_event().
> - */
> - event->parent = EVENT_TOMBSTONE;
> - put_event(parent_event);
> }
>
> static bool is_orphaned_event(struct perf_event *event)
> @@ -5705,7 +5698,7 @@ static void put_event(struct perf_event *event)
> _free_event(event);
>
> /* Matches the refcount bump in inherit_event() */
> - if (parent && parent != EVENT_TOMBSTONE)
> + if (parent)
> put_event(parent);
> }
>
> @@ -9998,7 +9991,7 @@ void perf_event_text_poke(const void *addr, const void *old_bytes,
>
> void perf_event_itrace_started(struct perf_event *event)
> {
> - event->attach_state |= PERF_ATTACH_ITRACE;
> + WRITE_ONCE(event->attach_state, event->attach_state | PERF_ATTACH_ITRACE);
> }
>
> static void perf_log_itrace_start(struct perf_event *event)
> @@ -13922,10 +13915,7 @@ perf_event_exit_event(struct perf_event *event,
> {
> struct perf_event *parent_event = event->parent;
> unsigned long detach_flags = DETACH_EXIT;
> - bool is_child = !!parent_event;
> -
> - if (parent_event == EVENT_TOMBSTONE)
> - parent_event = NULL;
> + unsigned int attach_state;
>
> if (parent_event) {
> /*
> @@ -13942,6 +13932,8 @@ perf_event_exit_event(struct perf_event *event,
> */
> detach_flags |= DETACH_GROUP | DETACH_CHILD;
> mutex_lock(&parent_event->child_mutex);
> + /* PERF_ATTACH_ITRACE might be set concurrently */
> + attach_state = READ_ONCE(event->attach_state);
> }
>
> if (revoke)
> @@ -13951,18 +13943,25 @@ perf_event_exit_event(struct perf_event *event,
> /*
> * Child events can be freed.
> */
> - if (is_child) {
> - if (parent_event) {
> - mutex_unlock(&parent_event->child_mutex);
> - /*
> - * Kick perf_poll() for is_event_hup();
> - */
> - perf_event_wakeup(parent_event);
> + if (parent_event) {
> + mutex_unlock(&parent_event->child_mutex);
> + /*
> + * Kick perf_poll() for is_event_hup();
> + */
> + perf_event_wakeup(parent_event);
Should not this perf_event_wakeup() be inside the next if() as well?
doing anything on parent_event when !ATTACH_CHILD seems dodgy.
> +
> + /*
> + * Match the refcount initialization. Make sure it doesn't happen
> + * twice if pmu_detach_event() calls it on an already exited task.
> + */
> + if (attach_state & PERF_ATTACH_CHILD) {
> /*
> * pmu_detach_event() will have an extra refcount.
> + * perf_pending_task() might have one too.
> */
> put_event(event);
> }
> +
> return;
> }
This is a *much* saner patch, thank you!
So the thing I worried about... which is why I chose for the TOMBSTONE
thing, is that this second invocation will now dereference parent_event,
even though we've already released our reference count on it.
This is essentially a use-after-free.
The thing that makes it work is RCU. And I think we're good, since the
fail case is two perf_event_exit_event() invocations on the same event,
separated by an RCU grace period, and I don't think this can happen.
But its a shame we can't reliably detect that.. Oh well.
Powered by blists - more mailing lists