linux-kernel - [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250804130326.57523-1-xupengbo@oppo.com>
Date: Mon, 4 Aug 2025 21:03:26 +0800
From: xupengbo <xupengbo@...o.com>
To: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, "Mel
 Gorman" <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
	<linux-kernel@...r.kernel.org>
CC: <xupengbo@...o.com>, <cgroups@...r.kernel.org>
Subject: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg_contrib when the last task migrates out.

We added a function named update_tg_load_avg_immediately() that mimics
update_tg_load_avg(). In this function we remove the update interval
restriction from update_tg_load_avg() in order to update tg->load
immediately when the function is called. This function is only called in
update_load_avg(). In update_load_avg(), we should call
update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
reason is as follows.

1. Due to the 1ms update period limitation in update_tg_load_avg(), there
is a possibility that the reduced load_avg is not updated to tg->load_avg
when a task migrates out.
2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
function cfs_rq_is_decayed() does not check whether
cfs->tg_load_avg_contrib is null. Consequently, in some cases,
__update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
updated to tg->load_avg.

When these two events occur within the 1ms window (defined by
NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
the CPU due to the cpumask constraints, the corresponding portion of
load_avg will never be subtracted from tg->load_avg. This results in an
inflated tg->load_avg and reduced scheduling entity (se) weight for the
task group. If the migrating task had a large weight, the task group's
share may deviate significantly from its expected value. This issue is
easily reproducible in task migration scenarios.

Initially, I discovered this bug on Android 16 (running kernel v6.12), and
was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
I wrote a short C program which just does 3 things:
  1. call sched_setaffinity() to bound itself to cpu 1.
  2. call sched_setaffinity() to bound itself to cpu 2.
  3. endless loop.

Here is the source code.
```
\#define _GNU_SOURCE
\#include <sched.h>
\#include <unistd.h>
int main() {
  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(1, &cpuset);
  pid_t pid = gettid();

  if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
    return 1;
  }

  CPU_ZERO(&cpuset);
  CPU_SET(2, &cpuset);

  if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
    return 1;
  }
  while (1)
    ;
  return 0;
}
```

Then I made a test script to add tasks into groups.
(Forgive me for just copying and pasting those lines but not using
a for-loop)

```
\#!/usr/bin/bash

shares=100
pkill 'bug_test'
sleep 2
rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}

echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares

nohup ./bug_test &
proc1=$!
echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
nohup ./bug_test &
proc2=$!
echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
nohup ./bug_test &
proc3=$!
echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
nohup ./bug_test &
proc4=$!
echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs

```

After several repetitions of the script, we can find that some
processes have a smaller share of the cpu, while others have twice
that. This state is stable until the end of the process.

$ ps u -C bug_test
USER    PID   %CPU %MEM    VSZ   RSS TTY    STAT START   TIME COMMAND
root    13924 33.3  0.0   2556  1196 ?     R    18:55   0:56 ./bug_test
root    13925 16.6  0.0   2556  1196 ?     R    18:55   0:28 ./bug_test
root    13926 33.2  0.0   2556  1196 ?     R    18:55   0:56 ./bug_test
root    13927 16.6  0.0   2556  1200 ?     R    18:55   0:28 ./bug_test

Link: https://lore.kernel.org/all/20210501141950.23622-2-odin@uged.al/
This phenomenon is very much like the one mentioned in the above bug fix
patch.

Link: https://lore.kernel.org/cgroups/CAFpoUr1zGNf9vTbWjwsfY9E8YBjyE5xJ0SwzLebPiS7b=xz_Zw@mail.gmail.com/
Link: https://lore.kernel.org/cgroups/CAKfTPtA6AyL2f-KqHXecZrYKmZ9r9mT=Ks6BeNLjV9dfbSZJxQ@mail.gmail.com/
And there are also hints of this in the above responses. Maybe there's
nothing wrong with this patch, except that it doesn't take into account
the case that cfs_rq->tg_load_avg_contrib != 0 but
cfs_rq_is_decayed() == true when the last cfs task is migrated from this
CPU.

It is also easy to reproduce when autogroup is turned on. We just need to
repeat the following command over and over again.

$ sudo nohup ./bug_test 2>&1 &

PS: You must use sudo or the new process will be added to the same
autogroup.

$ ps u -C bug_test

USER      PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root    14444 18.8  0.0   2556  1196 pts/5    RN   19:30   0:16 ./bug_test
root    14458 17.9  0.0   2556  1196 pts/6    RN   19:30   0:15 ./bug_test
root    14472 17.0  0.0   2556  1196 pts/7    RN   19:30   0:14 ./bug_test
root    14486 16.7  0.0   2556  1196 pts/8    RN   19:30   0:13 ./bug_test
root    14500 33.2  0.0   2556  1196 pts/9    RN   19:30   0:27 ./bug_test

As we can see, each process was supposed to get a share of 20, but 4 out
of 5 encountered the bug.

Same environment but another experiment:
In /sys/kernel/debug/sched/
$ cat debug | grep 'bug_test'|  awk '{print $2, $3, $4, $16}'
bug_test 20245 21805.633289 /autogroup-891
bug_test 20256 10453.702263 /autogroup-892
bug_test 20270 20333.813653 /autogroup-893
bug_test 20284 19965.294506 /autogroup-894
bug_test 20298 9781.445557 /autogroup-895
rker:bug_test.c 13189 10.359087 /autogroup-515
rker:bug_test.c 13190 11.425550 /autogroup-515

$ cat debug | grep -A 28 'cfs_rq\[2\]:/autogroup' | \
egrep "(tg_load_avg|cfs_rq)"
cfs_rq[2]:/autogroup-891
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 335
cfs_rq[2]:/autogroup-892
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 670
cfs_rq[2]:/autogroup-893
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 335
cfs_rq[2]:/autogroup-894
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 335
cfs_rq[2]:/autogroup-895
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 670

$ cat debug | grep -A 28 '\]:/autogroup-892' | \
egrep '(tg_load_avg|cfs_rq|\.nr_queued|\.load)'
cfs_rq[2]:/autogroup-892
  .nr_queued                     : 1
  .load                          : 343040
  .load_avg                      : 335
  .removed.load_avg              : 0
  .tg_load_avg_contrib           : 335
  .tg_load_avg                   : 670
  .se->avg.load_avg              : 511

There is only one task in autogroup-892, even though it has doubled
tg_load_avg. As far as I know, this should not be a feature of autogroup.

The above experiments were all done on ubuntu24.04 with kernel v6.16-RC7.
On Android, I instrumented trace points at these two code locations for
the update_tg_load_avg() and __update_blocked_fair() functions.
```
static bool __update_blocked_fair(struct rq *rq, bool *done)
{
  ......
    if (cfs_rq_is_decayed(cfs_rq)) {
      //trace here
      list_del_leaf_cfs_rq(cfs_rq);
    }
  ......
}

static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
{
  ......
  now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
  if (now - cfs_rq->last_update_tg_load_avg < NSEC_PER_MSEC) {
    //trace here
    return;
  }
  ......
}
```
I'm sorry I can't provide these traces. But It's easy to see with
perfetto, that the abnormal cfs_rq hitting both if-branches within 1ms.
At first, the only task in the cgroup was moved into cpu 1, called
update_tg_load_avg and updated its timestamp. It was then migrated out
within 1ms, but when it was migrated out, the update failed due to the
update interval limit. At this point cfs_rq is still in the
leaf_cfs_rq_list, but in some cases cfs_rq_is_decay() returns true within
1ms, and the value of cfs_rq->tg_load_avg_contrib approximates the task's
load_avg which is not null.

When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
sometimes not. If load_sum is reduced to null, then this cfs_rq will be
removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
update it anymore.

Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
that this patch was not merged. In fact, if there were a check in
cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
the size of the _delta_ value (see update_tg_load_avg()), this issue
could also be effectively resolved. This solution would block (2.),
because if delta is too large, cfs_rq_is_decayed() returns false, and the
cfs_rq remains in leaf_cfs_rq_list, ultimately causing
__update_blocked_fair() to update it outside the 1ms limit. The only
consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
cfs_rq_is_decayed(), which may increase coupling.

Signed-off-by: xupengbo <xupengbo@...o.com>
---
 kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..97feba367be9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 	return true;
 }
 
+/* only called in update_load_avg() */
+static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
+{
+	long delta;
+	u64 now;
+
+	/*
+	 * No need to update load_avg for root_task_group as it is not used.
+	 */
+	if (cfs_rq->tg == &root_task_group)
+		return;
+
+	/* rq has been offline and doesn't contribute to the share anymore: */
+	if (!cpu_active(cpu_of(rq_of(cfs_rq))))
+		return;
+
+	/*
+	 * Under normal circumstances, for migration heavy workloads, access
+	 * to tg->load_avg can be unbound. Limit the update rate to at most
+	 * once per ms.
+	 * However when the last task is migrating from this cpu, we must
+	 * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
+	 * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
+	 * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
+	 * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
+	 * can never be corrected, which will lead to a significant value
+	 * error in se.weight for this group.
+	 * We retain this value filter below because it is not the main cause
+	 * of this bug, so we are being conservative.
+	 */
+	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
+	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
+		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
+		cfs_rq->last_update_tg_load_avg = now;
+	}
+}
+
 /**
  * update_tg_load_avg - update the tg's load avg
  * @cfs_rq: the cfs_rq whose avg changed
@@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
 
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
 
+static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
+
 static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
 
 static inline int propagate_entity_load_avg(struct sched_entity *se)
@@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 		/*
 		 * DO_DETACH means we're here from dequeue_entity()
 		 * and we are migrating task out of the CPU.
+		 *
+		 * At this point, we have not subtracted nr_queued.
+		 * If cfs_rq->nr_queued ==1, the last cfs task is being
+		 * migrated from this cfs_rq.
 		 */
 		detach_entity_load_avg(cfs_rq, se);
-		update_tg_load_avg(cfs_rq);
+		if (cfs_rq->nr_queued == 1)
+			update_tg_load_avg_immediately(cfs_rq);
+		else
+			update_tg_load_avg(cfs_rq);
 	} else if (decayed) {
 		cfs_rq_util_change(cfs_rq, 0);
 
-- 
2.43.0