linux-kernel - Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <YT3PR01MB51060ABC887AAC68F3713BCFC22AA@YT3PR01MB5106.CANPRD01.PROD.OUTLOOK.COM>
Date: Wed, 13 Aug 2025 00:57:08 +0000
From: Carl-Elliott Bilodeau-Savaria
	<carl-elliott.bilodeau-savaria@...l.mcgill.ca>
To: Vincent Guittot <vincent.guittot@...aro.org>
CC: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"peterz@...radead.org" <peterz@...radead.org>, "mingo@...hat.com"
	<mingo@...hat.com>, "juri.lelli@...hat.com" <juri.lelli@...hat.com>,
	"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "rostedt@...dmis.org"
	<rostedt@...dmis.org>, "bsegall@...gle.com" <bsegall@...gle.com>,
	"mgorman@...e.de" <mgorman@...e.de>, "vschneid@...hat.com"
	<vschneid@...hat.com>
Subject: Re: sched: cgroup cpu.weight unfairness for intermittent tasks on
 wake-up

Hi Vincent,
Thanks for the follow-up. Answers below.

> How long does it take for periodic load balance to migrate the task back on the right CPU?

I instrumented the kernel to mark each "conflict", defined as both high-weight tasks being co-located on one CPU. For conflicts resolved specifically by periodic load balance, the delay from the second intermittent task's wake-up to fairness being restored was:

- median: 432 ms
- mean: 733 ms
- p95: 2,507 ms

These figures are consistent with the earlier observation that the bad placement often persists for several hundred milliseconds (occasionally multiple seconds), aligning with the ~32% combined-throughput drop for the two intermittent tasks when the low-weight tasks are present.

> Do you have more details on your topology?

- Single socket, single NUMA node; 8 physical cores with SMT (16 logical CPUs).
- Caches: per-core private L1d/L1i/L2; shared L3 across the package.
- CPU 0 and 1 are different cores (not SMT siblings). Sibling pairs:
  (0,8), (1,9), (2,10), (3,11), (4,12), (5,13), (6,14), (7,15)

Cheers,

Carl-Elliott

________________________________________
From: Vincent Guittot <vincent.guittot@...aro.org>
Sent: Monday, August 4, 2025 11:56 AM
To: Carl-Elliott Bilodeau-Savaria
Cc: linux-kernel@...r.kernel.org; peterz@...radead.org; mingo@...hat.com; juri.lelli@...hat.com; dietmar.eggemann@....com; rostedt@...dmis.org; bsegall@...gle.com; mgorman@...e.de; vschneid@...hat.com; sched@...r.kernel.org
Subject: Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

Hi Carl-Eliott,

On Sat, 26 Jul 2025 at 22:59, Carl-Elliott Bilodeau-Savaria
<carl-elliott.bilodeau-savaria@...l.mcgill.ca> wrote:
>
> Hi everyone,
>
> Apologies for the noise. I'm gently pinging on this scheduling question from about 10 days ago as it may have been missed. I have now added the scheduler mailing list and the relevant maintainers to the CC list.
>
> I've also created a small GitHub repo to reproduce the issue: https://github.com/normal-account/sched-wakeup-locality-test
>
> Any insights would be greatly appreciated.
>
> Thanks,
> Carl-Elliott
>
> --- [Original Email Below] ---
>
> Hi sched maintainers,
>
> I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
>
>
> SYSTEM & TEST SETUP
> -------------------------
>
> System Details:
>     - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
>     - CONFIG_PREEMPT=y
>     - CPU governor: performance
>     - SMT: Enabled
>
> Workloads:
>     - continuous-burn: A simple, non-stop while(1) loop.
>     - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
>
> Cgroup Configuration:
>
>    parent/ (cpuset.cpus="0-1")
>        ├── lw/ (cpu.weight=1)
>        │    └── 1x continuous-burn process
>        └── hw/ (cpu.weight=10000)
>        └── 2x intermittent-burn processes
>
> The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
>
>
> PROBLEM SCENARIO & ANALYSIS
> -------------------------------------
>
> The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
>
> 1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
>         CPU 0 rq: [ (idle) ]
>         CPU 1 rq: [ continuous-1 (running) ]
>         (Sleeping tasks: intermittent-0, intermittent-1)
>
> 2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
>         CPU 0 rq: [ intermittent-1 (running) ]
>         CPU 1 rq: [ continuous-1 (running) ]
>         (Sleeping tasks: intermittent-0)
>
> 3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
>         CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
>         CPU 1 rq: [ continuous-1 (running) ]
>
> Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
>
> This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.

How long does it take for periodic load balance to migrate the task
back on the right CPU?

>
>
> OBSERVED IMPACT
> ---------------------
>
> With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.

32% is quite large for a 3sec running / 3 sec sleeping pattern. Looks
like the periodic load balance takes too much time before fixing the
unfairness

Do you have more details on your topology ?

>
> This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
>
>
> QUESTIONS
> -------------
>
> I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
>
> However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
> Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
>
>     - Is this considered a known-issue/an expected trade-off under EEVDF's design?
>     - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
>
>
> Thank you for your help!
>
> (Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)
>
> ________________________________________
> From: Carl-Elliott Bilodeau-Savaria
> Sent: Tuesday, July 15, 2025 6:44 PM
> To: linux-kernel@...r.kernel.org
> Cc: peterz@...radead.org
> Subject: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up
>
> Hi sched maintainers,
>
> I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
>
>
> SYSTEM & TEST SETUP
> -------------------------
>
> System Details:
>     - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
>     - CONFIG_PREEMPT=y
>     - CPU governor: performance
>     - SMT: Enabled
>
> Workloads:
>     - continuous-burn: A simple, non-stop while(1) loop.
>     - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
>
> Cgroup Configuration:
>
>    parent/ (cpuset.cpus="0-1")
>        ├── lw/ (cpu.weight=1)
>        │    └── 1x continuous-burn process
>        └── hw/ (cpu.weight=10000)
>        └── 2x intermittent-burn processes
>
> The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
>
>
> PROBLEM SCENARIO & ANALYSIS
> -------------------------------------
>
> The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
>
> 1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
>         CPU 0 rq: [ (idle) ]
>         CPU 1 rq: [ continuous-1 (running) ]
>         (Sleeping tasks: intermittent-0, intermittent-1)
>
> 2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
>         CPU 0 rq: [ intermittent-1 (running) ]
>         CPU 1 rq: [ continuous-1 (running) ]
>         (Sleeping tasks: intermittent-0)
>
> 3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
>         CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
>         CPU 1 rq: [ continuous-1 (running) ]
>
> Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
>
> This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.
>
>
> OBSERVED IMPACT
> ---------------------
>
> With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.
>
> This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
>
>
> QUESTIONS
> -------------
>
> I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
>
> However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
> Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
>
>     - Is this considered a known-issue/an expected trade-off under EEVDF's design?
>     - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
>
>
> Thank you for your help!
>
> (Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)