[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAKfTPtBEF92wUPsBF25ye3Dg5gUJr_giXcX5FSDF5RAo6dtS2w@mail.gmail.com>
Date: Mon, 4 Aug 2025 17:56:00 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Carl-Elliott Bilodeau-Savaria <carl-elliott.bilodeau-savaria@...l.mcgill.ca>
Cc: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"peterz@...radead.org" <peterz@...radead.org>, "mingo@...hat.com" <mingo@...hat.com>,
"juri.lelli@...hat.com" <juri.lelli@...hat.com>,
"dietmar.eggemann@....com" <dietmar.eggemann@....com>, "rostedt@...dmis.org" <rostedt@...dmis.org>,
"bsegall@...gle.com" <bsegall@...gle.com>, "mgorman@...e.de" <mgorman@...e.de>,
"vschneid@...hat.com" <vschneid@...hat.com>, "sched@...r.kernel.org" <sched@...r.kernel.org>
Subject: Re: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up
Hi Carl-Eliott,
On Sat, 26 Jul 2025 at 22:59, Carl-Elliott Bilodeau-Savaria
<carl-elliott.bilodeau-savaria@...l.mcgill.ca> wrote:
>
> Hi everyone,
>
> Apologies for the noise. I'm gently pinging on this scheduling question from about 10 days ago as it may have been missed. I have now added the scheduler mailing list and the relevant maintainers to the CC list.
>
> I've also created a small GitHub repo to reproduce the issue: https://github.com/normal-account/sched-wakeup-locality-test
>
> Any insights would be greatly appreciated.
>
> Thanks,
> Carl-Elliott
>
> --- [Original Email Below] ---
>
> Hi sched maintainers,
>
> I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
>
>
> SYSTEM & TEST SETUP
> -------------------------
>
> System Details:
> - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
> - CONFIG_PREEMPT=y
> - CPU governor: performance
> - SMT: Enabled
>
> Workloads:
> - continuous-burn: A simple, non-stop while(1) loop.
> - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
>
> Cgroup Configuration:
>
> parent/ (cpuset.cpus="0-1")
> ├── lw/ (cpu.weight=1)
> │ └── 1x continuous-burn process
> └── hw/ (cpu.weight=10000)
> └── 2x intermittent-burn processes
>
> The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
>
>
> PROBLEM SCENARIO & ANALYSIS
> -------------------------------------
>
> The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
>
> 1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
> CPU 0 rq: [ (idle) ]
> CPU 1 rq: [ continuous-1 (running) ]
> (Sleeping tasks: intermittent-0, intermittent-1)
>
> 2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
> CPU 0 rq: [ intermittent-1 (running) ]
> CPU 1 rq: [ continuous-1 (running) ]
> (Sleeping tasks: intermittent-0)
>
> 3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
> CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
> CPU 1 rq: [ continuous-1 (running) ]
>
> Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
>
> This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.
How long does it take for periodic load balance to migrate the task
back on the right CPU?
>
>
> OBSERVED IMPACT
> ---------------------
>
> With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.
32% is quite large for a 3sec running / 3 sec sleeping pattern. Looks
like the periodic load balance takes too much time before fixing the
unfairness
Do you have more details on your topology ?
>
> This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
>
>
> QUESTIONS
> -------------
>
> I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
>
> However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
> Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
>
> - Is this considered a known-issue/an expected trade-off under EEVDF's design?
> - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
>
>
> Thank you for your help!
>
> (Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)
>
> ________________________________________
> From: Carl-Elliott Bilodeau-Savaria
> Sent: Tuesday, July 15, 2025 6:44 PM
> To: linux-kernel@...r.kernel.org
> Cc: peterz@...radead.org
> Subject: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up
>
> Hi sched maintainers,
>
> I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.
>
>
> SYSTEM & TEST SETUP
> -------------------------
>
> System Details:
> - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
> - CONFIG_PREEMPT=y
> - CPU governor: performance
> - SMT: Enabled
>
> Workloads:
> - continuous-burn: A simple, non-stop while(1) loop.
> - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.
>
> Cgroup Configuration:
>
> parent/ (cpuset.cpus="0-1")
> ├── lw/ (cpu.weight=1)
> │ └── 1x continuous-burn process
> └── hw/ (cpu.weight=10000)
> └── 2x intermittent-burn processes
>
> The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.
>
>
> PROBLEM SCENARIO & ANALYSIS
> -------------------------------------
>
> The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness.
>
> 1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep.
> CPU 0 rq: [ (idle) ]
> CPU 1 rq: [ continuous-1 (running) ]
> (Sleeping tasks: intermittent-0, intermittent-1)
>
> 2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
> CPU 0 rq: [ intermittent-1 (running) ]
> CPU 1 rq: [ continuous-1 (running) ]
> (Sleeping tasks: intermittent-0)
>
> 3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
> CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
> CPU 1 rq: [ continuous-1 (running) ]
>
> Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.
>
> This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.
>
>
> OBSERVED IMPACT
> ---------------------
>
> With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone.
>
> This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.
>
>
> QUESTIONS
> -------------
>
> I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.
>
> However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
> Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.
>
> - Is this considered a known-issue/an expected trade-off under EEVDF's design?
> - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?
>
>
> Thank you for your help!
>
> (Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)
Powered by blists - more mailing lists