[<prev] [next>] [day] [month] [year] [list]
Message-ID:
<YQXPR01MB511370FEF51299CEFBF7A511C257A@YQXPR01MB5113.CANPRD01.PROD.OUTLOOK.COM>
Date: Tue, 15 Jul 2025 22:44:09 +0000
From: Carl-Elliott Bilodeau-Savaria
<carl-elliott.bilodeau-savaria@...l.mcgill.ca>
To: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "peterz@...radead.org" <peterz@...radead.org>
Subject: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up
Hi sched maintainers,
I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
SYSTEM & TEST SETUP
-------------------------
System Details:
- CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
- CONFIG_PREEMPT=y
- CPU governor: performance
- SMT: Enabled
Workloads:
- continuous-burn: A simple, non-stop while(1) loop.
- intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
Cgroup Configuration:
parent/ (cpuset.cpus="0-1")
├── lw/ (cpu.weight=1)
│ └── 1x continuous-burn process
└── hw/ (cpu.weight=10000)
└── 2x intermittent-burn processes
The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
PROBLEM SCENARIO & ANALYSIS
-------------------------------------
The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
CPU 0 rq: [ (idle) ]
CPU 1 rq: [ continuous-1 (running) ]
(Sleeping tasks: intermittent-0, intermittent-1)
2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
CPU 0 rq: [ intermittent-1 (running) ]
CPU 1 rq: [ continuous-1 (running) ]
(Sleeping tasks: intermittent-0)
3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
CPU 1 rq: [ continuous-1 (running) ]
Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.
OBSERVED IMPACT
---------------------
With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.
This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
QUESTIONS
-------------
I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
- Is this considered a known-issue/an expected trade-off under EEVDF's design?
- Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
Thank you for your help!
(Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)
Powered by blists - more mailing lists