linux-kernel - Re: [PATCH] sched/fair: Fix unfairness caused by stalled tg_load_avg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDexWX5N0jaMRqVQEnURSaPhVa6XQr_TexpU2SGU-e=9A@mail.gmail.com>
Date: Tue, 5 Aug 2025 11:17:29 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Aaron Lu <ziqianlu@...edance.com>
Cc: xupengbo <xupengbo@...o.com>, Ingo Molnar <mingo@...hat.com>, 
	Peter Zijlstra <peterz@...radead.org>, Juri Lelli <juri.lelli@...hat.com>, 
	Dietmar Eggemann <dietmar.eggemann@....com>, Steven Rostedt <rostedt@...dmis.org>, 
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, 
	Valentin Schneider <vschneid@...hat.com>, linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH] sched/fair: Fix unfairness caused by stalled
 tg_load_avg_contrib when the last task migrates out.

On Tue, 5 Aug 2025 at 11:08, Aaron Lu <ziqianlu@...edance.com> wrote:
>
> On Mon, Aug 04, 2025 at 09:03:26PM +0800, xupengbo wrote:
> > We added a function named update_tg_load_avg_immediately() that mimics
> > update_tg_load_avg(). In this function we remove the update interval
> > restriction from update_tg_load_avg() in order to update tg->load
> > immediately when the function is called. This function is only called in
> > update_load_avg(). In update_load_avg(), we should call
> > update_tg_load_avg_immediately() if flag & DO_DETACH == true and the task
> > is the last task in cfs_rq, otherwise we call update_tg_load_avg(). The
> > reason is as follows.
> >
> > 1. Due to the 1ms update period limitation in update_tg_load_avg(), there
> > is a possibility that the reduced load_avg is not updated to tg->load_avg
> > when a task migrates out.
> > 2. Even though __update_blocked_fair() traverses the leaf_cfs_rq_list and
> > calls update_tg_load_avg() for cfs_rqs that are not fully decayed, the key
> > function cfs_rq_is_decayed() does not check whether
> > cfs->tg_load_avg_contrib is null. Consequently, in some cases,
> > __update_blocked_fair() removes cfs_rqs whose avg.load_avg has not been
> > updated to tg->load_avg.
> >
> > When these two events occur within the 1ms window (defined by
> > NSEC_PER_MSEC in update_tg_load_avg()) and no other tasks can migrate to
> > the CPU due to the cpumask constraints, the corresponding portion of
> > load_avg will never be subtracted from tg->load_avg. This results in an
> > inflated tg->load_avg and reduced scheduling entity (se) weight for the
> > task group. If the migrating task had a large weight, the task group's
> > share may deviate significantly from its expected value. This issue is
> > easily reproducible in task migration scenarios.
> >
>
> Agree this is a problem.
>
> > Initially, I discovered this bug on Android 16 (running kernel v6.12), and
> > was subsequently able to reproduce it on an 8-core Ubuntu 24.04 VM with
> > kernel versions v6.14 and v6.16-rc7. I believe it exists in any kernel
> > version that defines both CONFIG_FAIR_GROUP_SCHED and CONFIG_SMP.
> > I wrote a short C program which just does 3 things:
> >   1. call sched_setaffinity() to bound itself to cpu 1.
> >   2. call sched_setaffinity() to bound itself to cpu 2.
> >   3. endless loop.
> >
> > Here is the source code.
> > ```
> > \#define _GNU_SOURCE
> > \#include <sched.h>
> > \#include <unistd.h>
> > int main() {
> >   cpu_set_t cpuset;
> >   CPU_ZERO(&cpuset);
> >   CPU_SET(1, &cpuset);
> >   pid_t pid = gettid();
> >
> >   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
> >     return 1;
> >   }
> >
> >   CPU_ZERO(&cpuset);
> >   CPU_SET(2, &cpuset);
> >
> >   if (sched_setaffinity(pid, sizeof(cpu_set_t), &cpuset) == -1) {
> >     return 1;
> >   }
> >   while (1)
> >     ;
> >   return 0;
> > }
> > ```
> >
> > Then I made a test script to add tasks into groups.
> > (Forgive me for just copying and pasting those lines but not using
> > a for-loop)
> >
> > ```
> > \#!/usr/bin/bash
> >
> > shares=100
> > pkill 'bug_test'
> > sleep 2
> > rmdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> > mkdir /sys/fs/cgroup/cpu/bug_test_{1..4}
> >
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_1/cpu.shares
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_2/cpu.shares
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_3/cpu.shares
> > echo $shares >/sys/fs/cgroup/cpu/bug_test_4/cpu.shares
> >
> > nohup ./bug_test &
> > proc1=$!
> > echo "$proc1" >/sys/fs/cgroup/cpu/bug_test_1/cgroup.procs
> > nohup ./bug_test &
> > proc2=$!
> > echo "$proc2" >/sys/fs/cgroup/cpu/bug_test_2/cgroup.procs
> > nohup ./bug_test &
> > proc3=$!
> > echo "$proc3" >/sys/fs/cgroup/cpu/bug_test_3/cgroup.procs
> > nohup ./bug_test &
> > proc4=$!
> > echo "$proc4" >/sys/fs/cgroup/cpu/bug_test_4/cgroup.procs
> >
> > ```
> >
> > After several repetitions of the script, we can find that some
> > processes have a smaller share of the cpu, while others have twice
> > that. This state is stable until the end of the process.
>
> I can reproduce it using your test code.
>
> > When a task is migrated from cfs_rq, dequeue_load_avg() will subtract its
> > avg.load_sum and avg.load_avg. Sometimes its load_sum is reduced to null
> > sometimes not. If load_sum is reduced to null, then this cfs_rq will be
> > removed from the leaf_cfs_rq_list soon. So __update_blocked_fair() can not
> > update it anymore.
> >
> > Link: https://lore.kernel.org/cgroups/20210518125202.78658-2-odin@uged.al/
> > In this patch, Odin proposed adding a check in cfs_rq_is_decayed() to
> > determine whether cfs_rq->tg_load_avg_contrib is null. However, it appears
> > that this patch was not merged. In fact, if there were a check in
> > cfs_rq_is_decayed() similar to the one in update_tg_load_avg() regarding
> > the size of the _delta_ value (see update_tg_load_avg()), this issue
> > could also be effectively resolved. This solution would block (2.),
> > because if delta is too large, cfs_rq_is_decayed() returns false, and the
> > cfs_rq remains in leaf_cfs_rq_list, ultimately causing
> > __update_blocked_fair() to update it outside the 1ms limit. The only
> > consideration is whether to add a check for cfs_rq->tg_load_avg_contrib in
> > cfs_rq_is_decayed(), which may increase coupling.
> >
>
> Performance wise, doing an immediate update to tg->load_avg on detach
> path should be OK because last time when I did those tests, the migration
> path that leads to updates to tg->load_avg is mostly from task wakeup path.
> I also did some quick tests using hackbench and netperf on an Intel EMR
> and didn't notice anything problematic regarding performance with your
> change here.
>
> With this said, I think adding cfs_rq->tg_load_avg_contrib check in
> cfs_rq_is_decayed() makes more sense to me, because if a cfs_rq still has

Adding a check of cfs_rq->tg_load_avg_contrib in cfs_rq_is_decayed()
makes more sense for me too because it doesn't bypass the ratelimit

> contrib to its tg but its load_avg is 0, it should stay in that list and
> have its contrib synced with its load_avg to zero when that 1ms window
> has passed by __update_blocked_fair() path.
>
> > Signed-off-by: xupengbo <xupengbo@...o.com>
>
> Maybe add a fix tag for commit 1528c661c24b("sched/fair: Ratelimit
> update to tg->load_avg")?
>
> Thanks,
> Aaron
>
> > ---
> >  kernel/sched/fair.c | 50 ++++++++++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 49 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b173a059315c..97feba367be9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4065,6 +4065,45 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
> >       return true;
> >  }
> >
> > +/* only called in update_load_avg() */
> > +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq)
> > +{
> > +     long delta;
> > +     u64 now;
> > +
> > +     /*
> > +      * No need to update load_avg for root_task_group as it is not used.
> > +      */
> > +     if (cfs_rq->tg == &root_task_group)
> > +             return;
> > +
> > +     /* rq has been offline and doesn't contribute to the share anymore: */
> > +     if (!cpu_active(cpu_of(rq_of(cfs_rq))))
> > +             return;
> > +
> > +     /*
> > +      * Under normal circumstances, for migration heavy workloads, access
> > +      * to tg->load_avg can be unbound. Limit the update rate to at most
> > +      * once per ms.
> > +      * However when the last task is migrating from this cpu, we must
> > +      * update tg->load_avg immediately. Otherwise, if this cfs_rq becomes
> > +      * idle forever due to cpumask and is removed from leaf_cfs_rq_list,
> > +      * the huge mismatch between cfs_rq->avg.load_avg(which may be zero)
> > +      * and cfs_rq->tg_load_avg_contrib(stalled load_avg of last task)
> > +      * can never be corrected, which will lead to a significant value
> > +      * error in se.weight for this group.
> > +      * We retain this value filter below because it is not the main cause
> > +      * of this bug, so we are being conservative.
> > +      */
> > +     now = sched_clock_cpu(cpu_of(rq_of(cfs_rq)));
> > +     delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +     if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > +             atomic_long_add(delta, &cfs_rq->tg->load_avg);
> > +             cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +             cfs_rq->last_update_tg_load_avg = now;
> > +     }
> > +}
> > +
> >  /**
> >   * update_tg_load_avg - update the tg's load avg
> >   * @cfs_rq: the cfs_rq whose avg changed
> > @@ -4449,6 +4488,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> >
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> >
> > +static inline void update_tg_load_avg_immediately(struct cfs_rq *cfs_rq) {}
> > +
> >  static inline void clear_tg_offline_cfs_rqs(struct rq *rq) {}
> >
> >  static inline int propagate_entity_load_avg(struct sched_entity *se)
> > @@ -4747,9 +4788,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >               /*
> >                * DO_DETACH means we're here from dequeue_entity()
> >                * and we are migrating task out of the CPU.
> > +              *
> > +              * At this point, we have not subtracted nr_queued.
> > +              * If cfs_rq->nr_queued ==1, the last cfs task is being
> > +              * migrated from this cfs_rq.
> >                */
> >               detach_entity_load_avg(cfs_rq, se);
> > -             update_tg_load_avg(cfs_rq);
> > +             if (cfs_rq->nr_queued == 1)
> > +                     update_tg_load_avg_immediately(cfs_rq);
> > +             else
> > +                     update_tg_load_avg(cfs_rq);
> >       } else if (decayed) {
> >               cfs_rq_util_change(cfs_rq, 0);
> >
> > --
> > 2.43.0
> >