linux-kernel - Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <d962c3ca-8a3f-4fe7-507e-1158f755e3cd@os.amperecomputing.com>
Date: Fri, 6 Feb 2026 10:50:13 -0800 (PST)
From: Shubhang Kaushik <shubhang@...amperecomputing.com>
To: Christian Loehle <christian.loehle@....com>, 
    Vincent Guittot <vincent.guittot@...aro.org>
cc: linux-kernel@...r.kernel.org, peterz@...radead.org, mingo@...hat.com, 
    juri.lelli@...hat.com, dietmar.eggemann@....com, kprateek.nayak@....com, 
    pierre.gondois@....com
Subject: Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task

On Fri, 6 Feb 2026, Christian Loehle wrote:

> On 2/5/26 18:52, Shubhang Kaushik wrote:
>> On Thu, 5 Feb 2026, Vincent Guittot wrote:
>>
>>> On Thu, 5 Feb 2026 at 01:00, Shubhang Kaushik
>>> <shubhang@...amperecomputing.com> wrote:
>>>>
>>>> On Tue, 3 Feb 2026, Christian Loehle wrote:
>>>>
>>>>> CPUs whose rq only have SCHED_IDLE tasks running are considered to be
>>>>> equivalent to truly idle CPUs during wakeup path. For fork and exec
>>>>> SCHED_IDLE is even preferred.
>>>>> This is based on the assumption that the SCHED_IDLE CPU is not in an
>>>>> idle state and might be in a higher P-state, allowing the task/wakee
>>>>> to run immediately without sharing the rq.
>>>>>
>>>>> However this assumption doesn't hold if the wakee has SCHED_IDLE policy
>>>>> itself, as it will share the rq with existing SCHED_IDLE tasks. In this
>>>>> case, we are better off continuing to look for a truly idle CPU.
>>>>>
>>>>> On a Intel Xeon 2-socket with 64 logical cores in total this yields
>>>>> for kernel compilation using SCHED_IDLE:
>>>>>
>>>>> +---------+----------------------+----------------------+--------+
>>>>> | workers | mainline (seconds)   | patch (seconds)      | delta% |
>>>>> +=========+======================+======================+========+
>>>>> |       1 | 4384.728 ± 21.085    | 3843.250 ± 16.235    | -12.35 |
>>>>> |       2 | 2242.513 ± 2.099     | 1971.696 ± 2.842     | -12.08 |
>>>>> |       4 | 1199.324 ± 1.823     | 1033.744 ± 1.803     | -13.81 |
>>>>> |       8 |  649.083 ± 1.959     |  559.123 ± 4.301     | -13.86 |
>>>>> |      16 |  370.425 ± 0.915     |  325.906 ± 4.623     | -12.02 |
>>>>> |      32 |  234.651 ± 2.255     |  217.266 ± 0.253     |  -7.41 |
>>>>> |      64 |  202.286 ± 1.452     |  197.977 ± 2.275     |  -2.13 |
>>>>> |     128 |  217.092 ± 1.687     |  212.164 ± 1.138     |  -2.27 |
>>>>> +---------+----------------------+----------------------+--------+
>>>>>
>>>>> Signed-off-by: Christian Loehle <christian.loehle@....com>
>>>>
>>>> I’ve been testing this patch on an 80-core Ampere Altra (Neoverse-N1) and
>>>> the results look very solid. On these high-core-count ARM systems, we
>>>> definitely see the benefit of being pickier about where we place
>>>> SCHED_IDLE tasks.
>>>>
>>>> Treating an occupied SCHED_IDLE rq as idle seems to cause
>>>> unnecessary packing that shows up in the tail latency. By spreading these
>>>> background tasks to truly idle cores, I'm seeing a nice boost in both
>>>> background compilation and AI inference throughput.
>>>>
>>>> The reduction in sys time confirms that the domain balancing remains
>>>> stable despite the refactor to sched_idle_rq(rq) as you and Prateek
>>>> mentioned.
>>>>
>>>> 1. Background Kernel Compilation:
>>>>
>>>> I ran `time nice -n 19 make -j$nproc` to see how it handles a heavy
>>>
>>> nice -n 19 uses sched_other with prio 19 and not sched_idle so I'm
>>> curious how you can see a difference ?
>>> Or something is missing in your test description
>>> Or we have a bug somewhere
>>>
>>
>> Okay, I realized I had used nice -n 19 (SCHED_OTHER) for the initial build, which wouldn't have directly triggered the SCHED_IDLE logic. But, I did use chrt for the schbench runs, which is why those p99 wins were so consistent.
>>
>> I've re-run the kernel build using the correct chrt --idle 0 policy. On Ampere Altra, the throughput is along the same lines as mainline.
>>
>> Metric    Mainline    Patched        Delta
>> Real    9m 20.120s    9m 18.472s    -1.6s
>> User    382m 24.966s    380m 41.716s    -1m 43s
>> Sys    218m 26.192s    218m 44.908s    +18.7s
>>
>
> Thanks for testing Shubhang, although I find it a bit surprising that your
> kernel compilation under SCHED_IDLE doesn't improve.
> Are you running with CONFIG_SCHED_CLUSTER=y? I'll try to reproduce.
> Anyway at least you see a schbench improvement, I'm assuming I'll
> keep you Tested-by?
>
>

Yes, that's right CONFIG_SCHED_CLUSTER=y is enabled. That likely 
explains why the build throughput isn't shifting as much as your Xeon 
results, though the drop in the user time still suggests better 
efficiency.

Feel free to keep the Tested-by tag.
Tested-by: Shubhang Kaushik shubhang@...amperecomputing.com