[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250424163024.GC18306@noisy.programming.kicks-ass.net>
Date: Thu, 24 Apr 2025 18:30:24 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Frederic Weisbecker <frederic@...nel.org>
Cc: Ingo Molnar <mingo@...hat.com>, LKML <linux-kernel@...r.kernel.org>,
"Liang, Kan" <kan.liang@...ux.intel.com>,
Adrian Hunter <adrian.hunter@...el.com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Ian Rogers <irogers@...gle.com>, Jiri Olsa <jolsa@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Namhyung Kim <namhyung@...nel.org>,
Ravi Bangoria <ravi.bangoria@....com>,
linux-perf-users@...r.kernel.org
Subject: Re: [PATCH 2/4] perf: Fix irq work dereferencing garbage
On Thu, Apr 24, 2025 at 06:11:26PM +0200, Frederic Weisbecker wrote:
> The following commit:
>
> da916e96e2de ("perf: Make perf_pmu_unregister() useable")
>
> has introduced two significant event's parent lifecycle changes:
>
> 1) An event that has exited now has EVENT_TOMBSTONE as a parent.
> This can result in a situation where the delayed wakeup irq_work can
> accidentally dereference EVENT_TOMBSTONE on:
>
> CPU 0 CPU 1
> ----- -----
>
> __schedule()
> local_irq_disable()
> rq_lock()
> <NMI>
> perf_event_overflow()
> irq_work_queue(&child->pending_irq)
> </NMI>
> perf_event_task_sched_out()
> raw_spin_lock(&ctx->lock)
> ctx_sched_out()
> ctx->is_active = 0
> event_sched_out(child)
> raw_spin_unlock(&ctx->lock)
> perf_event_release_kernel(parent)
> perf_remove_from_context(child)
> raw_spin_lock_irq(&ctx->lock)
> // Sees !ctx->is_active
> // Removes from context inline
> __perf_remove_from_context(child)
> perf_child_detach(child)
> event->parent = EVENT_TOMBSTONE
> raw_spin_rq_unlock_irq(rq);
> <IRQ>
> perf_pending_irq()
> perf_event_wakeup(child)
> ring_buffer_wakeup(child)
> rcu_dereference(child->parent->rb) <--- CRASH
>
> This also concerns the call to kill_fasync() on parent->fasync.
Argh, I actually looked for this case and didn't find it in one of the
earlier fixes :/
> 2) The final parent reference count decrement can now happen before the
> the final child reference count decrement. ie: the parent can now
> be freed before its child. On PREEMPT_RT, this can result in a
> situation where the delayed wakeup irq_work can accidentally
> dereference a freed parent:
>
> CPU 0 CPU 1 CPU 2
> ----- ----- ------
>
> perf_pmu_unregister()
> pmu_detach_events()
> pmu_get_event()
> atomic_long_inc_not_zero(&child->refcount)
>
> <NMI>
> perf_event_overflow()
> irq_work_queue(&child->pending_irq);
> </NMI>
> <IRQ>
> irq_work_run()
> wake_irq_workd()
> </IRQ>
> preempt_schedule_irq()
> =========> SWITCH to workd
> irq_work_run_list()
> perf_pending_irq()
> perf_event_wakeup(child)
> ring_buffer_wakeup(child)
> event = child->parent
>
> perf_event_release_kernel(parent)
> // Not last ref, PMU holds it
> put_event(child)
> // Last ref
> put_event(parent)
> free_event()
> call_rcu(...)
> rcu_core()
> free_event_rcu()
>
> rcu_dereference(event->rb) <--- CRASH
>
> This also concerns the call to kill_fasync() on parent->fasync.
>
> The "easy" solution to 1) is to check that event->parent is not
> EVENT_TOMBSTONE on perf_event_wakeup() (including both ring buffer
> and fasync uses).
>
> The "easy" solution to 2) is to turn perf_event_wakeup() to wholefully
> run under rcu_read_lock().
>
> However because of 2), sanity would prescribe to make event::parent
> an __rcu pointer and annotate each and every users to prove they are
> reliable.
>
> Propose an alternate solution and restore the stable pointer to the
> parent until all its children have called _free_event() themselves to
> avoid any further accident. Also revert the EVENT_TOMBSTONE design
> that is mostly here to determine which caller of perf_event_exit_event()
> must perform the refcount decrement on a child event matching the
> increment in inherit_event().
>
> Arrange instead for checking the attach state of an event prior to its
> removal and decrement the refcount of the child accordingly.
Urgh, brain hurts, will have to look again tomorrow.
> @@ -13940,29 +13941,36 @@ perf_event_exit_event(struct perf_event *event,
> * Do destroy all inherited groups, we don't care about those
> * and being thorough is better.
> */
> - detach_flags |= DETACH_GROUP | DETACH_CHILD;
> + prd.detach_flags |= DETACH_GROUP | DETACH_CHILD;
> mutex_lock(&parent_event->child_mutex);
> }
>
> if (revoke)
> - detach_flags |= DETACH_GROUP | DETACH_REVOKE;
> + prd.detach_flags |= DETACH_GROUP | DETACH_REVOKE;
>
> - perf_remove_from_context(event, detach_flags);
> + perf_remove_from_context(event, &prd);
Isn't all this waay to complicated?
That is, to modify state we need both ctx->mutex and ctx->lock, and this
is what __perf_remove_from_context() has, but because of this, holding
either one of those locks is sufficient to read the state -- it cannot
change.
And here we already hold ctx->mutex.
So can't we simply do:
old_state = event->attach_state;
perf_remove_from_context(event, detach_flags);
// do whatever with old_state
> /*
> * Child events can be freed.
> */
> - if (is_child) {
> - if (parent_event) {
> - mutex_unlock(&parent_event->child_mutex);
> - /*
> - * Kick perf_poll() for is_event_hup();
> - */
> - perf_event_wakeup(parent_event);
> + if (parent_event) {
> + mutex_unlock(&parent_event->child_mutex);
> + /*
> + * Kick perf_poll() for is_event_hup();
> + */
> + perf_event_wakeup(parent_event);
> +
> + /*
> + * Match the refcount initialization. Make sure it doesn't happen
> + * twice if pmu_detach_event() calls it on an already exited task.
> + */
> + if (prd.old_state & PERF_ATTACH_CHILD) {
> /*
> * pmu_detach_event() will have an extra refcount.
> + * perf_pending_task() might have one too.
> */
> put_event(event);
> }
> +
> return;
> }
>
Powered by blists - more mailing lists