linux-kernel - Re: [tip:sched/urgent] sched/fair: Clear ->h_load

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <aVh1Fiar6aC4W_1D@templeofstupid.com>
Date: Fri, 2 Jan 2026 17:47:02 -0800
From: Krister Johansen <kjlx@...pleofstupid.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Cruz Zhao <CruzZhao@...ux.alibaba.com>, tip-bot2@...utronix.de,
	linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org,
	mingo@...nel.org, x86@...nel.org,
	Peng Wang <peng_wang@...ux.alibaba.com>,
	Peter Zijlstra <peterz@...radead.org>
Subject: Re: [tip:sched/urgent] sched/fair: Clear ->h_load_next when
 unregistering a cgroup

Hi Vincent,

On Mon, Dec 29, 2025 at 02:58:16PM +0100, Vincent Guittot wrote:
> On Mon, 29 Dec 2025 at 13:51, Cruz Zhao <CruzZhao@...ux.alibaba.com> wrote:
> > I noticed that the following patch has been queued in the
> > tip:sched/urgent branch for some time but hasn't yet made
> > it into mainline:
> > https://lore.kernel.org/all/176478073513.498.15089394378873483436.tip-bot2@tip-bot2/
> >
> > Could you please check if there's anything blocking its
> > merge? I wanted to ensure it doesn’t get overlooked.
> 
> From an off list discussion w/ Peter, we need to check that this patch
> is not hiding the root cause that task_h_load is not called in the
> right context i.e. with rcu_read_lock(). Peter pointed out one place
> in numa [1]
> 
> [1] https://lore.kernel.org/all/20251015124422.GD3419281@noisy.programming.kicks-ass.net/

If it helps, I've double-checked this code a few times.  When I looked,
there were 7 different callers of task_h_load(), and they decompose into
3 cases.

1. rcu_read_lock is held as we expect
2. the numa balancing cases Peter already identified
3. tick related invocations, where the caller is in interrupt context

For 3, there's an edge case where deferred work is scheduled if the
target cpu is in full nohz mode and has stopped.

In the cases where I'm hitting this bug, the systems aren't using numa
balancing and aren't using nohz. 90% of ones I've analyzed are in a
futex wakeup and are holding the rcu_read_lock.

This seems like just a case of the pointer continuing to reference
memory that was already free'd.  If the task group's sched entity is
freed, but the parent cfs_rq still has a pointer to that sched_entity in
h_load_next, then it may end up accessing that memory accidentally if we
do not clear it.

Put another way, even if all of these callers used rcu_read_lock, there
would still be a need to ensure that the parent's h_load_next doesn't
point to a sched entity that is free'd once the RCU read-side critical
section is exited, because the child is getting free'd and not the
parent.  The (freed) child is still discoverable from the parent's
h_load_next after the critical section because the delete code does not
clear h_load_next and order that write before the free.

-K