linux-kernel - Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtC_xo7HzagsQ2vMavuimWeCnYWanCydN3=Jv3GJsvWQPg@mail.gmail.com>
Date: Fri, 15 Aug 2025 15:52:49 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Carl-Elliott Bilodeau-Savaria <carl-elliott.bilodeau-savaria@...l.mcgill.ca>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, 
	"peterz@...radead.org" <peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>, 
	"juri.lelli@...hat.com" <juri.lelli@...hat.com>, 
	"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "rostedt@...dmis.org" <rostedt@...dmis.org>, 
	"bsegall@...gle.com" <bsegall@...gle.com>, "mgorman@...e.de" <mgorman@...e.de>, 
	"vschneid@...hat.com" <vschneid@...hat.com>
Subject: Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

On Wed, 13 Aug 2025 at 02:57, Carl-Elliott Bilodeau-Savaria
<carl-elliott.bilodeau-savaria@...l.mcgill.ca> wrote:
>
> Hi Vincent,
> Thanks for the follow-up. Answers below.
>
> > How long does it take for periodic load balance to migrate the task back on the right CPU?
>
> I instrumented the kernel to mark each "conflict", defined as both high-weight tasks being co-located on one CPU. For conflicts resolved specifically by periodic load balance, the delay from the second intermittent task's wake-up to fairness being restored was:
>
> - median: 432 ms
> - mean: 733 ms
> - p95: 2,507 ms
>
> These figures are consistent with the earlier observation that the bad placement often persists for several hundred milliseconds (occasionally multiple seconds), aligning with the ~32% combined-throughput drop for the two intermittent tasks when the low-weight tasks are present.
>
> > Do you have more details on your topology?
>
> - Single socket, single NUMA node; 8 physical cores with SMT (16 logical CPUs).
> - Caches: per-core private L1d/L1i/L2; shared L3 across the package.
> - CPU 0 and 1 are different cores (not SMT siblings). Sibling pairs:
>   (0,8), (1,9), (2,10), (3,11), (4,12), (5,13), (6,14), (7,15)

Okay I've been able to reproduce your problem. Your problem comes from
your CPU affinity of the tasks which screw up load balancing. Your 3
tasks are pinned to CPU0 and CPU1 so they can't migrate anywhere else.
Because they are fully busy or overloaded, we try to balance load
instead of task but the imbalance is calculated over the whole
scheduling domain so even if you have CPU1 with lw task (weight 1) and
CPU0 with 2 hw tasks (weight 10000, 5000 each), the average "load" per
CPU should be 625 (=10001/16). The load balance will try to migrate
625-1 from CPU0 to CPU1 but the weight of hw task is 5000 which is
higher than 624 and the load balance failed in order to let other CPUs
of the domain to pull some task/load. This is not possible because of
their CPU affinity. Side note: all this is impacted by the load of
other tasks running on the socket.

When a load balance fails, the next one will be less strict about the
amount of load to be migrated because we check "(task's load >> number
of failed load balance) <= imbalance". So 5000 is too big and fails
then 2500 then 1250 , 625 and 312 which is lower than 624 and one hw
task is migrated to CPU0
The periodic load balance of a busy CPU is 16*domain's weight =
16*16=256 so you need between 3 and 4 periods before migrating the
task 768-1024ms

Vincent

>
> Cheers,
>
> Carl-Elliott
>
> ________________________________________
> From: Vincent Guittot <vincent.guittot@...aro.org>
> Sent: Monday, August 4, 2025 11:56 AM
> To: Carl-Elliott Bilodeau-Savaria
> Cc: linux-kernel@...r.kernel.org; peterz@...radead.org; mingo@...hat.com; juri.lelli@...hat.com; dietmar.eggemann@....com; rostedt@...dmis.org; bsegall@...gle.com; mgorman@...e.de; vschneid@...hat.com; sched@...r.kernel.org
> Subject: Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up
>
> Hi Carl-Eliott,
>
> On Sat, 26 Jul 2025 at 22:59, Carl-Elliott Bilodeau-Savaria
> <carl-elliott.bilodeau-savaria@...l.mcgill.ca> wrote:
> >
> > Hi everyone,
> >
> > Apologies for the noise. I'm gently pinging on this scheduling question from about 10 days ago as it may have been missed. I have now added the scheduler mailing list and the relevant maintainers to the CC list.
> >
> > I've also created a small GitHub repo to reproduce the issue: https://github.com/normal-account/sched-wakeup-locality-test
> >
> > Any insights would be greatly appreciated.
> >
> > Thanks,
> > Carl-Elliott
> >
> > --- [Original Email Below] ---
> >
> > Hi sched maintainers,
> >
> > I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
> >
> >
> > SYSTEM & TEST SETUP
> > -------------------------
> >
> > System Details:
> >     - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
> >     - CONFIG_PREEMPT=y
> >     - CPU governor: performance
> >     - SMT: Enabled
> >
> > Workloads:
> >     - continuous-burn: A simple, non-stop while(1) loop.
> >     - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
> >
> > Cgroup Configuration:
> >
> >    parent/ (cpuset.cpus="0-1")
> >        ├── lw/ (cpu.weight=1)
> >        │    └── 1x continuous-burn process
> >        └── hw/ (cpu.weight=10000)
> >        └── 2x intermittent-burn processes
> >
> > The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
> >
> >
> > PROBLEM SCENARIO & ANALYSIS
> > -------------------------------------
> >
> > The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
> >
> > 1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
> >         CPU 0 rq: [ (idle) ]
> >         CPU 1 rq: [ continuous-1 (running) ]
> >         (Sleeping tasks: intermittent-0, intermittent-1)
> >
> > 2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
> >         CPU 0 rq: [ intermittent-1 (running) ]
> >         CPU 1 rq: [ continuous-1 (running) ]
> >         (Sleeping tasks: intermittent-0)
> >
> > 3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
> >         CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
> >         CPU 1 rq: [ continuous-1 (running) ]
> >
> > Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
> >
> > This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.
>
> How long does it take for periodic load balance to migrate the task
> back on the right CPU?
>
> >
> >
> > OBSERVED IMPACT
> > ---------------------
> >
> > With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.
>
> 32% is quite large for a 3sec running / 3 sec sleeping pattern. Looks
> like the periodic load balance takes too much time before fixing the
> unfairness
>
> Do you have more details on your topology ?
>
> >
> > This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
> >
> >
> > QUESTIONS
> > -------------
> >
> > I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
> >
> > However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
> > Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
> >
> >     - Is this considered a known-issue/an expected trade-off under EEVDF's design?
> >     - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
> >
> >
> > Thank you for your help!
> >
> > (Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)
> >
> > ________________________________________
> > From: Carl-Elliott Bilodeau-Savaria
> > Sent: Tuesday, July 15, 2025 6:44 PM
> > To: linux-kernel@...r.kernel.org
> > Cc: peterz@...radead.org
> > Subject: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up
> >
> > Hi sched maintainers,
> >
> > I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
> >
> >
> > SYSTEM & TEST SETUP
> > -------------------------
> >
> > System Details:
> >     - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
> >     - CONFIG_PREEMPT=y
> >     - CPU governor: performance
> >     - SMT: Enabled
> >
> > Workloads:
> >     - continuous-burn: A simple, non-stop while(1) loop.
> >     - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
> >
> > Cgroup Configuration:
> >
> >    parent/ (cpuset.cpus="0-1")
> >        ├── lw/ (cpu.weight=1)
> >        │    └── 1x continuous-burn process
> >        └── hw/ (cpu.weight=10000)
> >        └── 2x intermittent-burn processes
> >
> > The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
> >
> >
> > PROBLEM SCENARIO & ANALYSIS
> > -------------------------------------
> >
> > The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
> >
> > 1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
> >         CPU 0 rq: [ (idle) ]
> >         CPU 1 rq: [ continuous-1 (running) ]
> >         (Sleeping tasks: intermittent-0, intermittent-1)
> >
> > 2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
> >         CPU 0 rq: [ intermittent-1 (running) ]
> >         CPU 1 rq: [ continuous-1 (running) ]
> >         (Sleeping tasks: intermittent-0)
> >
> > 3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
> >         CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
> >         CPU 1 rq: [ continuous-1 (running) ]
> >
> > Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
> >
> > This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.
> >
> >
> > OBSERVED IMPACT
> > ---------------------
> >
> > With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.
> >
> > This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
> >
> >
> > QUESTIONS
> > -------------
> >
> > I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
> >
> > However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
> > Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
> >
> >     - Is this considered a known-issue/an expected trade-off under EEVDF's design?
> >     - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
> >
> >
> > Thank you for your help!
> >
> > (Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)