lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <f3d77b74d72da0c627ff4b4fe9d430969da6b900.1761200831.git.peng_wang@linux.alibaba.com>
Date: Thu, 23 Oct 2025 14:28:27 +0800
From: Peng Wang <peng_wang@...ux.alibaba.com>
To: vincent.guittot@...aro.org
Cc: bsegall@...gle.com,
	dietmar.eggemann@....com,
	juri.lelli@...hat.com,
	linux-kernel@...r.kernel.org,
	mgorman@...e.de,
	mingo@...hat.com,
	peng_wang@...ux.alibaba.com,
	peterz@...radead.org,
	rostedt@...dmis.org,
	vdavydov.dev@...il.com,
	vschneid@...hat.com,
	stable@...r.kernel.org
Subject: [PATCH v2] sched/fair: Clear ->h_load_next when unregistering cgroup

An invalid pointer dereference bug was reported on arm64 cpu, and has
not yet been seen on x86. A partial oops looks like:

 Call trace:
  update_cfs_rq_h_load+0x80/0xb0
  wake_affine+0x158/0x168
  select_task_rq_fair+0x364/0x3a8
  try_to_wake_up+0x154/0x648
  wake_up_q+0x68/0xd0
  futex_wake_op+0x280/0x4c8
  do_futex+0x198/0x1c0
  __arm64_sys_futex+0x11c/0x198

Link: https://lore.kernel.org/all/20251013071820.1531295-1-CruzZhao@linux.alibaba.com/

We found that the task_group corresponding to the problematic se
is not in the parent task_group’s children list, indicating that
h_load_next points to an invalid address. Consider the following
cgroup and task hierarchy:

         A
        / \
       /   \
      B     E
     / \    |
    /   \   t2
   C     D
   |     |
   t0    t1

Here follows a timing sequence that may be responsible for triggering
the problem:

CPU X                   CPU Y                   CPU Z
wakeup t0
set list A->B->C
traverse A->B->C
t0 exits
destroy C
                        wakeup t2
                        set list A->E           wakeup t1
                                                set list A->B->D
                        traverse A->B->C
                        panic

CPU Z sets ->h_load_next list to A->B->D, but due to arm64 weaker memory
ordering, Y may observe A->B before it sees B->D, then in this time window,
it can traverse A->B->C and reach an invalid se.

We can avoid stale pointer accesses by clearing ->h_load_next when
unregistering cgroup.

Suggested-by: Vincent Guittot <vincent.guittot@...aro.org>
Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
Cc: <stable@...r.kernel.org>
Co-developed-by: Cruz Zhao <CruzZhao@...ux.alibaba.com>
Signed-off-by: Cruz Zhao <CruzZhao@...ux.alibaba.com>
Signed-off-by: Peng Wang <peng_wang@...ux.alibaba.com>
---
 kernel/sched/fair.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cee1793e8277..a5fce15093d3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13427,6 +13427,14 @@ void unregister_fair_sched_group(struct task_group *tg)
 				list_del_leaf_cfs_rq(cfs_rq);
 			}
 			remove_entity_load_avg(se);
+			/*
+			 * Clear parent's h_load_next if it points to the
+			 * sched_entity being freed to avoid stale pointer.
+			 */
+			struct cfs_rq *parent_cfs_rq = cfs_rq_of(se);
+
+			if (READ_ONCE(parent_cfs_rq->h_load_next) == se)
+				WRITE_ONCE(parent_cfs_rq->h_load_next, NULL);
 		}
 
 		/*
-- 
2.27.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ