linux-kernel - Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <81418e43-22d6-9046-0179-b77e85234f4d@os.amperecomputing.com>
Date: Thu, 5 Feb 2026 10:52:42 -0800 (PST)
From: Shubhang Kaushik <shubhang@...amperecomputing.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
cc: Christian Loehle <christian.loehle@....com>, linux-kernel@...r.kernel.org, 
    peterz@...radead.org, mingo@...hat.com, juri.lelli@...hat.com, 
    dietmar.eggemann@....com, kprateek.nayak@....com, pierre.gondois@....com
Subject: Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task

On Thu, 5 Feb 2026, Vincent Guittot wrote:

> On Thu, 5 Feb 2026 at 01:00, Shubhang Kaushik
> <shubhang@...amperecomputing.com> wrote:
>>
>> On Tue, 3 Feb 2026, Christian Loehle wrote:
>>
>>> CPUs whose rq only have SCHED_IDLE tasks running are considered to be
>>> equivalent to truly idle CPUs during wakeup path. For fork and exec
>>> SCHED_IDLE is even preferred.
>>> This is based on the assumption that the SCHED_IDLE CPU is not in an
>>> idle state and might be in a higher P-state, allowing the task/wakee
>>> to run immediately without sharing the rq.
>>>
>>> However this assumption doesn't hold if the wakee has SCHED_IDLE policy
>>> itself, as it will share the rq with existing SCHED_IDLE tasks. In this
>>> case, we are better off continuing to look for a truly idle CPU.
>>>
>>> On a Intel Xeon 2-socket with 64 logical cores in total this yields
>>> for kernel compilation using SCHED_IDLE:
>>>
>>> +---------+----------------------+----------------------+--------+
>>> | workers | mainline (seconds)   | patch (seconds)      | delta% |
>>> +=========+======================+======================+========+
>>> |       1 | 4384.728 ± 21.085    | 3843.250 ± 16.235    | -12.35 |
>>> |       2 | 2242.513 ± 2.099     | 1971.696 ± 2.842     | -12.08 |
>>> |       4 | 1199.324 ± 1.823     | 1033.744 ± 1.803     | -13.81 |
>>> |       8 |  649.083 ± 1.959     |  559.123 ± 4.301     | -13.86 |
>>> |      16 |  370.425 ± 0.915     |  325.906 ± 4.623     | -12.02 |
>>> |      32 |  234.651 ± 2.255     |  217.266 ± 0.253     |  -7.41 |
>>> |      64 |  202.286 ± 1.452     |  197.977 ± 2.275     |  -2.13 |
>>> |     128 |  217.092 ± 1.687     |  212.164 ± 1.138     |  -2.27 |
>>> +---------+----------------------+----------------------+--------+
>>>
>>> Signed-off-by: Christian Loehle <christian.loehle@....com>
>>
>> I’ve been testing this patch on an 80-core Ampere Altra (Neoverse-N1) and
>> the results look very solid. On these high-core-count ARM systems, we
>> definitely see the benefit of being pickier about where we place
>> SCHED_IDLE tasks.
>>
>> Treating an occupied SCHED_IDLE rq as idle seems to cause
>> unnecessary packing that shows up in the tail latency. By spreading these
>> background tasks to truly idle cores, I'm seeing a nice boost in both
>> background compilation and AI inference throughput.
>>
>> The reduction in sys time confirms that the domain balancing remains
>> stable despite the refactor to sched_idle_rq(rq) as you and Prateek
>> mentioned.
>>
>> 1. Background Kernel Compilation:
>>
>> I ran `time nice -n 19 make -j$nproc` to see how it handles a heavy
>
> nice -n 19 uses sched_other with prio 19 and not sched_idle so I'm
> curious how you can see a difference ?
> Or something is missing in your test description
> Or we have a bug somewhere
>

Okay, I realized I had used nice -n 19 (SCHED_OTHER) for the initial 
build, which wouldn't have directly triggered the SCHED_IDLE logic. 
But, I did use chrt for the schbench runs, which is why those p99 wins 
were so consistent.

I've re-run the kernel build using the correct chrt --idle 0 policy. On 
Ampere Altra, the throughput is along the same lines as mainline.

Metric	Mainline	Patched		Delta
Real	9m 20.120s	9m 18.472s	-1.6s
User	382m 24.966s	380m 41.716s	-1m 43s
Sys	218m 26.192s	218m 44.908s	+18.7s

  >> background load. We 
saved nearly 3 minutes of 'sys' time showing >> lower scheduler overhead.
>>
>> Mainline (6.19.0-rc8):
>> real 9m28.403s
>> sys 219m21.591s
>>
>> Patched:
>> real 9m16.167s (-12.2s)
>> sys 216m28.323s (-2m53s)
>>
>> I was initially concerned about the impact on domain balancing, but the
>> significant reduction in 'sys' time during the kernel build confirms that
>> we aren't seeing any regressive balancing overhead.
>>
>> 2. AI Inference (llama-batched-bench):
>>
>> For background LLM inference, the patch consistently delivered about 8.7%
>> more throughput when we're running near core saturation.
>>
>> 51 Threads: 30.03 t/s (vs 27.62 on Mainline) -> +8.7%
>> 80 Threads: 27.20 t/s (vs 25.01 on Mainline) -> +8.7%
>>
>> 3. Scheduler Latency using schbench:
>>
>> The biggest win was in the p99.9 tail latency. Under a locked workload,
>> the latency spikes dropped significantly.
>> 4 Threads (Locking): 10085 us (vs 12421 us) -> -18.8%
>> 8 Threads (Locking): 9563 us (vs 11589 us) -> -17.5%
>>
>> The patch really helps clean up the noise for background tasks on these
>> large ARM platforms. Nice work.
>>
>> Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
>>
>> Regards,
>> Shubhang Kaushik
>>
>>>       int cpu = rq->cpu;
>>> -     int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> ma> +   int busy = idle != CPU_IDLE && !sched_idle_rq(rq);
>>>       unsigned long interval;
>>>       struct sched_domain *sd;
>>>       /* Earliest time when we have to do rebalance again */
>>> @@ -12299,7 +12305,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>>                                * state even if we migrated tasks. Update it.
>>>                                */
>>>                               idle = idle_cpu(cpu);
>>> -                             busy = !idle && !sched_idle_cpu(cpu);
>>> +                             busy = !idle && !sched_idle_rq(rq);
>>>                       }
>>>                       sd->last_balance = jiffies;
>>>                       interval = get_sd_balance_interval(sd, busy);
>>> --
>>> 2.34.1
>>>
>>>
>