[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aWAMC3HPskHNQeOs@google.com>
Date: Thu, 8 Jan 2026 11:56:59 -0800
From: Namhyung Kim <namhyung@...nel.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: Ingo Molnar <mingo@...hat.com>,
Arnaldo Carvalho de Melo <acme@...nel.org>,
Mark Rutland <mark.rutland@....com>,
Alexander Shishkin <alexander.shishkin@...ux.intel.com>,
Jiri Olsa <jolsa@...nel.org>, Ian Rogers <irogers@...gle.com>,
Adrian Hunter <adrian.hunter@...el.com>,
James Clark <james.clark@...aro.org>,
linux-perf-users@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [BUG] perf/core: Task stuck on global_ctx_data_rwsem
On Wed, Jan 07, 2026 at 11:32:56PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 07, 2026 at 11:28:24PM +0100, Peter Zijlstra wrote:
> > On Wed, Jan 07, 2026 at 11:01:53AM -0800, Namhyung Kim wrote:
> >
> > > > But yes, I suppose this can do. The question is however, how do you get
> > > > into this predicament to begin with? Are you creating and destroying a
> > > > lot of global LBR events or something?
> > >
> > > I think it's just because there are too many tasks in the system like
> > > O(100K). And any thread going to exit needs to wait for
> > > attach_global_ctx_data() to finish the iteration over every task.
> >
> > OMG, so many tasks ...
> >
> > > > Would it make sense to delay detach_global_ctx_data() for a second or
> > > > so? That is, what is your event creation pattern?
> > >
> > > I don't think it has a special pattern, but I'm curious how we can
> > > handle a race like below.
> > >
> > > attach_global_ctx_data
> > > check p->flags & PF_EXITING
> > > do_exit
> > > (preemption) set PF_EXITING
> > > detach_task_ctx_data()
> > > check p->perf_ctx_data
> > > attach_task_ctx_data() ---> memory leak
> >
> > Oh right. Something like so perhaps?
> >
> > ---
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index 3c2a491200c6..e5e716420eb3 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -5421,9 +5421,19 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache,
> > return -ENOMEM;
> >
> > for (;;) {
> > - if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) {
> > + if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) {
> > if (old)
> > perf_free_ctx_data_rcu(old);
> > + /*
> > + * try_cmpxchg() pairs with try_cmpxchg() from
> > + * detach_task_ctx_data() such that
> > + * if we race with perf_event_exit_task(), we must
> > + * observe PF_EXITING.
> > + */
> > + if (task->flags & PF_EXITING) {
> > + task->perf_ctx_data = NULL;
> > + perf_free_ctx_data_rcu(cd);
>
> Ugh and now it can race and do a double free, another try_cmpxchg() is
> needed here.
Thanks! Something like this?
Namhyung
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 376fb07d869b8b50..cf252d8f49b2b259 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -5421,9 +5421,20 @@ attach_task_ctx_data(struct task_struct *task, struct kmem_cache *ctx_cache,
return -ENOMEM;
for (;;) {
- if (try_cmpxchg((struct perf_ctx_data **)&task->perf_ctx_data, &old, cd)) {
+ if (try_cmpxchg(&task->perf_ctx_data, &old, cd)) {
if (old)
perf_free_ctx_data_rcu(old);
+ /*
+ * try_cmpxchg() pairs with try_cmpxchg() from
+ * detach_task_ctx_data() such that
+ * if we race with perf_event_exit_task(), we must
+ * observe PF_EXITING.
+ */
+ if (task->flags & PF_EXITING) {
+ /* detach_task_ctx_data() may free it already */
+ if (try_cmpxchg(&task->perf_ctx_data, &cd, NULL))
+ perf_free_ctx_data_rcu(cd);
+ }
return 0;
}
@@ -5469,6 +5480,8 @@ attach_global_ctx_data(struct kmem_cache *ctx_cache)
/* Allocate everything */
scoped_guard (rcu) {
for_each_process_thread(g, p) {
+ if (p->flags & PF_EXITING)
+ continue;
cd = rcu_dereference(p->perf_ctx_data);
if (cd && !cd->global) {
cd->global = 1;
@@ -14562,8 +14575,11 @@ void perf_event_exit_task(struct task_struct *task)
/*
* Detach the perf_ctx_data for the system-wide event.
+ *
+ * Done without holding global_ctx_data_rwsem; typically
+ * attach_global_ctx_data() will skip over this task, but otherwise
+ * attach_task_ctx_data() will observe PF_EXITING.
*/
- guard(percpu_read)(&global_ctx_data_rwsem);
detach_task_ctx_data(task);
}
Powered by blists - more mailing lists