linux-kernel - Re: [tip:sched/urgent] sched/fair: Clear ->h_load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtC42qVKbng8bb8G4ebVz4PQ1HF3N5cyK3U0S37zxbTy-g@mail.gmail.com>
Date: Mon, 12 Jan 2026 18:32:44 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Krister Johansen <kjlx@...pleofstupid.com>
Cc: Cruz Zhao <CruzZhao@...ux.alibaba.com>, tip-bot2@...utronix.de, 
	linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org, 
	mingo@...nel.org, x86@...nel.org, Peng Wang <peng_wang@...ux.alibaba.com>, 
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [tip:sched/urgent] sched/fair: Clear ->h_load_next when
 unregistering a cgroup

On Sat, 3 Jan 2026 at 02:47, Krister Johansen <kjlx@...pleofstupid.com> wrote:
>
> Hi Vincent,
>
> On Mon, Dec 29, 2025 at 02:58:16PM +0100, Vincent Guittot wrote:
> > On Mon, 29 Dec 2025 at 13:51, Cruz Zhao <CruzZhao@...ux.alibaba.com> wrote:
> > > I noticed that the following patch has been queued in the
> > > tip:sched/urgent branch for some time but hasn't yet made
> > > it into mainline:
> > > https://lore.kernel.org/all/176478073513.498.15089394378873483436.tip-bot2@tip-bot2/
> > >
> > > Could you please check if there's anything blocking its
> > > merge? I wanted to ensure it doesn’t get overlooked.
> >
> > From an off list discussion w/ Peter, we need to check that this patch
> > is not hiding the root cause that task_h_load is not called in the
> > right context i.e. with rcu_read_lock(). Peter pointed out one place
> > in numa [1]
> >
> > [1] https://lore.kernel.org/all/20251015124422.GD3419281@noisy.programming.kicks-ass.net/
>
> If it helps, I've double-checked this code a few times.  When I looked,
> there were 7 different callers of task_h_load(), and they decompose into
> 3 cases.
>
> 1. rcu_read_lock is held as we expect
> 2. the numa balancing cases Peter already identified
> 3. tick related invocations, where the caller is in interrupt context
>
> For 3, there's an edge case where deferred work is scheduled if the
> target cpu is in full nohz mode and has stopped.

Thanks for this analysis. I'm aligned with your conclusion that we
have 2 calls which are not protected
-task_numa_find_cpu
-sched_tick_remote

>
> In the cases where I'm hitting this bug, the systems aren't using numa
> balancing and aren't using nohz. 90% of ones I've analyzed are in a
> futex wakeup and are holding the rcu_read_lock.

Do you have a simple way to reproduce it ?

>
> This seems like just a case of the pointer continuing to reference
> memory that was already free'd.  If the task group's sched entity is
> freed, but the parent cfs_rq still has a pointer to that sched_entity in
> h_load_next, then it may end up accessing that memory accidentally if we
> do not clear it.

I agree that rcu protection does not prevent cfs_rq->h_load_next from
holding a ref to the freed sched_entity after the grace period and
until update_cfs_rq_h_load() overwrites it with another child. I just
wonder how we can really end up with
 traverse A->B->C because we do set B->D then set A->B when doing set
list A->B->D

Vincent




>
> Put another way, even if all of these callers used rcu_read_lock, there
> would still be a need to ensure that the parent's h_load_next doesn't
> point to a sched entity that is free'd once the RCU read-side critical
> section is exited, because the child is getting free'd and not the
> parent.  The (freed) child is still discoverable from the parent's
> h_load_next after the critical section because the delete code does not
> clear h_load_next and order that write before the free.
>
> -K