[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtC42qVKbng8bb8G4ebVz4PQ1HF3N5cyK3U0S37zxbTy-g@mail.gmail.com>
Date: Mon, 12 Jan 2026 18:32:44 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Krister Johansen <kjlx@...pleofstupid.com>
Cc: Cruz Zhao <CruzZhao@...ux.alibaba.com>, tip-bot2@...utronix.de,
linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
mingo@...nel.org, x86@...nel.org, Peng Wang <peng_wang@...ux.alibaba.com>,
Peter Zijlstra <peterz@...radead.org>
Subject: Re: [tip:sched/urgent] sched/fair: Clear ->h_load_next when
unregistering a cgroup
On Sat, 3 Jan 2026 at 02:47, Krister Johansen <kjlx@...pleofstupid.com> wrote:
>
> Hi Vincent,
>
> On Mon, Dec 29, 2025 at 02:58:16PM +0100, Vincent Guittot wrote:
> > On Mon, 29 Dec 2025 at 13:51, Cruz Zhao <CruzZhao@...ux.alibaba.com> wrote:
> > > I noticed that the following patch has been queued in the
> > > tip:sched/urgent branch for some time but hasn't yet made
> > > it into mainline:
> > > https://lore.kernel.org/all/176478073513.498.15089394378873483436.tip-bot2@tip-bot2/
> > >
> > > Could you please check if there's anything blocking its
> > > merge? I wanted to ensure it doesn’t get overlooked.
> >
> > From an off list discussion w/ Peter, we need to check that this patch
> > is not hiding the root cause that task_h_load is not called in the
> > right context i.e. with rcu_read_lock(). Peter pointed out one place
> > in numa [1]
> >
> > [1] https://lore.kernel.org/all/20251015124422.GD3419281@noisy.programming.kicks-ass.net/
>
> If it helps, I've double-checked this code a few times. When I looked,
> there were 7 different callers of task_h_load(), and they decompose into
> 3 cases.
>
> 1. rcu_read_lock is held as we expect
> 2. the numa balancing cases Peter already identified
> 3. tick related invocations, where the caller is in interrupt context
>
> For 3, there's an edge case where deferred work is scheduled if the
> target cpu is in full nohz mode and has stopped.
Thanks for this analysis. I'm aligned with your conclusion that we
have 2 calls which are not protected
-task_numa_find_cpu
-sched_tick_remote
>
> In the cases where I'm hitting this bug, the systems aren't using numa
> balancing and aren't using nohz. 90% of ones I've analyzed are in a
> futex wakeup and are holding the rcu_read_lock.
Do you have a simple way to reproduce it ?
>
> This seems like just a case of the pointer continuing to reference
> memory that was already free'd. If the task group's sched entity is
> freed, but the parent cfs_rq still has a pointer to that sched_entity in
> h_load_next, then it may end up accessing that memory accidentally if we
> do not clear it.
I agree that rcu protection does not prevent cfs_rq->h_load_next from
holding a ref to the freed sched_entity after the grace period and
until update_cfs_rq_h_load() overwrites it with another child. I just
wonder how we can really end up with
traverse A->B->C because we do set B->D then set A->B when doing set
list A->B->D
Vincent
>
> Put another way, even if all of these callers used rcu_read_lock, there
> would still be a need to ensure that the parent's h_load_next doesn't
> point to a sched entity that is free'd once the RCU read-side critical
> section is exited, because the child is getting free'd and not the
> parent. The (freed) child is still discoverable from the parent's
> h_load_next after the critical section because the delete code does not
> clear h_load_next and order that write before the free.
>
> -K
Powered by blists - more mailing lists