linux-kernel - sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID:
 <YQXPR01MB511370FEF51299CEFBF7A511C257A@YQXPR01MB5113.CANPRD01.PROD.OUTLOOK.COM>
Date: Tue, 15 Jul 2025 22:44:09 +0000
From: Carl-Elliott Bilodeau-Savaria
	<carl-elliott.bilodeau-savaria@...l.mcgill.ca>
To: "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC: "peterz@...radead.org" <peterz@...radead.org>
Subject: sched: cgroup cpu.weight unfairness for intermittent tasks on wake-up

Hi sched maintainers,

I'm observing a CPU fairness issue in kernel 6.14 related to intermittent ("bursty") workloads under cgroup v2 with cpu.weight, where tasks do not receive CPU time proportional to their configured weights.

SYSTEM & TEST SETUP
-------------------------

System Details:
    - CPU: Intel Core i9-9980HK (8 cores, 16 threads, single L3 cache).
    - CONFIG_PREEMPT=y
    - CPU governor: performance
    - SMT: Enabled

Workloads:
    - continuous-burn: A simple, non-stop while(1) loop.
    - intermittent-burn: A loop that burns CPU for 3 seconds, then sleeps for 3 seconds.

Cgroup Configuration:

   parent/ (cpuset.cpus="0-1")
       ├── lw/ (cpu.weight=1)
       │    └── 1x continuous-burn process
       └── hw/ (cpu.weight=10000)
            └── 2x intermittent-burn processes

The goal is to have the two intermittent processes in the hw group strongly prioritized over the single continuous process in the lw group on CPUs 0 and 1.

PROBLEM SCENARIO & ANALYSIS
-------------------------------------

The issue stems from the scheduler's wake-up path logic. Here is a typical sequence of events that leads to the unfairness. 

1. The intermittent-0 process, previously running on CPU 0, finishes its burst and goes to sleep. 
        CPU 0 rq: [ (idle) ]
        CPU 1 rq: [ continuous-1 (running) ]
        (Sleeping tasks: intermittent-0, intermittent-1)

2. intermittent-1 wakes up. Its previous CPU (CPU 1) is busy, so it is placed on CPU 0 (idle) by `select_idle_sibling()`:
        CPU 0 rq: [ intermittent-1 (running) ]
        CPU 1 rq: [ continuous-1 (running) ]
        (Sleeping tasks: intermittent-0)

3. Finally, intermittent-0 wakes up. No CPUs are idle, so it's placed back on its previous CPU's runqueue (CPU 0), where it has to wait for intermittent-1.
        CPU 0 rq: [ intermittent-1 (running), intermittent-0 (waiting) ]
        CPU 1 rq: [ continuous-1 (running) ]

Now, both high-weight tasks are competing for CPU 0, while the low-weight task runs unopposed on CPU 1.

This unfair state can persist until periodic load balancing eventually migrates one of the tasks, but due to the frequent sleep/wake pattern, the initial placement decision has a disproportionately large effect.

OBSERVED IMPACT
---------------------

With the continuous-burn task present, the combined throughput (measured via loop iterations) of the two intermittent-burn tasks drops by ~32% compared to running them alone. 

This results in the low-weight task receiving a disproportionate share of CPU time, contrary to the cpu.weight configuration.

QUESTIONS
-------------

I understand that EEVDF's wake-up placement logic favors idle CPUs to minimize latency, which makes sense in general.

However, in this mixed-workload scenario, that logic seems to override cgroup fairness expectations.
Wake-up placement leads to high-weight tasks dog-piling on one CPU, leaving a low-weight task uncontended on another.

    - Is this considered a known-issue/an expected trade-off under EEVDF's design?
    - Are there any existing tunables (e.g. sched_features or sysctls) to adjust wake-up placement behavior or increase weight enforcement in such scenarios?

Thank you for your help!

(Note: Using RT scheduling isn’t viable in the real-world version of this workload, so I’m specifically interested in fairness within CFS/EEVDF.)