linux-kernel - Re: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250805090517.GA687566@bytedance>
Date: Tue, 5 Aug 2025 17:08:40 +0800
From: Aaron Lu <ziqianlu@...edance.com>
To: xupengbo <xupengbo@...o.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH] sched/fair: Fix unfairness caused by stalled
 tg_load_avg_contrib when the last task migrates out.

On Mon, Aug 04, 2025 at 09:03:26PM +0800, xupengbo wrote:
> We added a function named update_tg_load_avg_immediately() that mimics
> update_tg_load_avg(). In this function we remove the update interval
> restriction from update_tg_load_avg() in order to update tg->load
> immediately when the function is called. This function is only called in
> update_load_avg(). In update_load_avg(), we should call
> update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
> is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
> reason is as follows.
> 
> 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> is a possibility that the reduced load_avg is not updated to tg->load_avg
> when a task migrates out.
> 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> function cfs_rq_is_decayed() does not check whether
> cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> updated to tg->load_avg.
> 
> When these two events occur within the 1ms window (defined by
> NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
> the CPU due to the cpumask constraints, the corresponding portion of
> load_avg will never be subtracted from tg->load_avg. This results in an
> inflated tg->load_avg and reduced scheduling entity (se) weight for the
> task group. If the migrating task had a large weight, the task group's
> share may deviate significantly from its expected value. This issue is
> easily reproducible in task migration scenarios.
>

Agree this is a problem.

> Initially, I discovered this bug on Android 16 (running kernel v6.12), and
> was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
> kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
> version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
> I wrote a short C program which just does 3 things:
>   1. call sched_setaffinity() to bound itself to cpu 1.
>   2. call sched_setaffinity() to bound itself to cpu 2.
>   3. endless loop.
> 
> Here is the source code.
> ```
> \#define _GNU_SOURCE
> \#include <sched.h>
> \#include <unistd.h>
> int main() {
>   cpu_set_t cpuset;
>   CPU_ZERO(&cpuset);
>   CPU_SET(1, &cpuset);
>   pid_t pid = gettid();
> 
>   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
>     return 1;
>   }
> 
>   CPU_ZERO(&cpuset);
>   CPU_SET(2, &cpuset);
> 
>   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
>     return 1;
>   }
>   while (1)
>     ;
>   return 0;
> }
> ```
> 
> Then I made a test script to add tasks into groups.
> (Forgive me for just copying and pasting those lines but not using
> a for-loop)
> 
> ```
> \#!/usr/bin/bash
> 
> shares=100
> pkill 'bug_test'
> sleep 2
> rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> 
> echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
> echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares
> 
> nohup ./bug_test &
> proc1=$!
> echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
> nohup ./bug_test &
> proc2=$!
> echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
> nohup ./bug_test &
> proc3=$!
> echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
> nohup ./bug_test &
> proc4=$!
> echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs
> 
> ```
>
> After several repetitions of the script, we can find that some
> processes have a smaller share of the cpu, while others have twice
> that. This state is stable until the end of the process.
 
I can reproduce it using your test code.
 
> When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
> avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
> sometimes not. If load_sum is reduced to null, then this cfs_rq will be
> removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
> update it anymore.
> 
> Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
> In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
> determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
> that this patch was not merged. In fact, if there were a check in
> cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
> the size of the _delta_ value (see update_tg_load_avg()), this issue
> could also be effectively resolved. This solution would block (2.),
> because if delta is too large, cfs_rq_is_decayed() returns false, and the
> cfs_rq remains in leaf_cfs_rq_list, ultimately causing
> __update_blocked_fair() to update it outside the 1ms limit. The only
> consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
> cfs_rq_is_decayed(), which may increase coupling.
>

Performance wise, doing an immediate update to tg->load_avg on detach
path should be OK because last time when I did those tests, the migration
path that leads to updates to tg->load_avg is mostly from task wakeup path.
I also did some quick tests using hackbench and netperf on an Intel EMR
and didn't notice anything problematic regarding performance with your
change here.

With this said, I think adding cfs_rq->tg_load_avg_contrib check in
cfs_rq_is_decayed() makes more sense to me, because if a cfs_rq still has
contrib to its tg but its load_avg is 0, it should stay in that list and
have its contrib synced with its load_avg to zero when that 1ms window
has passed by __update_blocked_fair() path.

> Signed-off-by: xupengbo <xupengbo@...o.com>

Maybe add a fix tag for commit 1528c661c24b("sched/fair: Ratelimit
update to tg->load_avg")?

Thanks,
Aaron

> ---
>  kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b173a059315c..97feba367be9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
>  	return true;
>  }
>  
> +/* only called in update_load_avg() */
> +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
> +{
> +	long delta;
> +	u64 now;
> +
> +	/*
> +	 * No need to update load_avg for root_task_group as it is not used.
> +	 */
> +	if (cfs_rq->tg == &root_task_group)
> +		return;
> +
> +	/* rq has been offline and doesn't contribute to the share anymore: */
> +	if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> +		return;
> +
> +	/*
> +	 * Under normal circumstances, for migration heavy workloads, access
> +	 * to tg->load_avg can be unbound. Limit the update rate to at most
> +	 * once per ms.
> +	 * However when the last task is migrating from this cpu, we must
> +	 * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
> +	 * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
> +	 * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
> +	 * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
> +	 * can never be corrected, which will lead to a significant value
> +	 * error in se.weight for this group.
> +	 * We retain this value filter below because it is not the main cause
> +	 * of this bug, so we are being conservative.
> +	 */
> +	now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> +	delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> +	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> +		atomic_long_add(delta, &cfs_rq->tg->load_avg);
> +		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> +		cfs_rq->last_update_tg_load_avg = now;
> +	}
> +}
> +
>  /**
>   * update_tg_load_avg - update the tg's load avg
>   * @cfs_rq: the cfs_rq whose avg changed
> @@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
>  
>  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>  
> +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
> +
>  static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
>  
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
> @@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>  		/*
>  		 * DO_DETACH means we're here from dequeue_entity()
>  		 * and we are migrating task out of the CPU.
> +		 *
> +		 * At this point, we have not subtracted nr_queued.
> +		 * If cfs_rq->nr_queued ==1, the last cfs task is being
> +		 * migrated from this cfs_rq.
>  		 */
>  		detach_entity_load_avg(cfs_rq, se);
> -		update_tg_load_avg(cfs_rq);
> +		if (cfs_rq->nr_queued == 1)
> +			update_tg_load_avg_immediately(cfs_rq);
> +		else
> +			update_tg_load_avg(cfs_rq);
>  	} else if (decayed) {
>  		cfs_rq_util_change(cfs_rq, 0);
>  
> -- 
> 2.43.0
>