linux-kernel - Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3f17f9d6-d529-4ddc-97f2-8f5933d49f5e@arm.com>
Date: Fri, 6 Feb 2026 13:43:38 +0000
From: Christian Loehle <christian.loehle@....com>
To: Shubhang Kaushik <shubhang@...amperecomputing.com>,
 Vincent Guittot <vincent.guittot@...aro.org>
Cc: linux-kernel@...r.kernel.org, peterz@...radead.org, mingo@...hat.com,
 juri.lelli@...hat.com, dietmar.eggemann@....com, kprateek.nayak@....com,
 pierre.gondois@....com
Subject: Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task

On 2/5/26 18:52, Shubhang Kaushik wrote:
> On Thu, 5 Feb 2026, Vincent Guittot wrote:
> 
>> On Thu, 5 Feb 2026 at 01:00, Shubhang Kaushik
>> <shubhang@...amperecomputing.com> wrote:
>>>
>>> On Tue, 3 Feb 2026, Christian Loehle wrote:
>>>
>>>> CPUs whose rq only have SCHED_IDLE tasks running are considered to be
>>>> equivalent to truly idle CPUs during wakeup path. For fork and exec
>>>> SCHED_IDLE is even preferred.
>>>> This is based on the assumption that the SCHED_IDLE CPU is not in an
>>>> idle state and might be in a higher P-state, allowing the task/wakee
>>>> to run immediately without sharing the rq.
>>>>
>>>> However this assumption doesn't hold if the wakee has SCHED_IDLE policy
>>>> itself, as it will share the rq with existing SCHED_IDLE tasks. In this
>>>> case, we are better off continuing to look for a truly idle CPU.
>>>>
>>>> On a Intel Xeon 2-socket with 64 logical cores in total this yields
>>>> for kernel compilation using SCHED_IDLE:
>>>>
>>>> +---------+----------------------+----------------------+--------+
>>>> | workers | mainline (seconds)   | patch (seconds)      | delta% |
>>>> +=========+======================+======================+========+
>>>> |       1 | 4384.728 ± 21.085    | 3843.250 ± 16.235    | -12.35 |
>>>> |       2 | 2242.513 ± 2.099     | 1971.696 ± 2.842     | -12.08 |
>>>> |       4 | 1199.324 ± 1.823     | 1033.744 ± 1.803     | -13.81 |
>>>> |       8 |  649.083 ± 1.959     |  559.123 ± 4.301     | -13.86 |
>>>> |      16 |  370.425 ± 0.915     |  325.906 ± 4.623     | -12.02 |
>>>> |      32 |  234.651 ± 2.255     |  217.266 ± 0.253     |  -7.41 |
>>>> |      64 |  202.286 ± 1.452     |  197.977 ± 2.275     |  -2.13 |
>>>> |     128 |  217.092 ± 1.687     |  212.164 ± 1.138     |  -2.27 |
>>>> +---------+----------------------+----------------------+--------+
>>>>
>>>> Signed-off-by: Christian Loehle <christian.loehle@....com>
>>>
>>> I’ve been testing this patch on an 80-core Ampere Altra (Neoverse-N1) and
>>> the results look very solid. On these high-core-count ARM systems, we
>>> definitely see the benefit of being pickier about where we place
>>> SCHED_IDLE tasks.
>>>
>>> Treating an occupied SCHED_IDLE rq as idle seems to cause
>>> unnecessary packing that shows up in the tail latency. By spreading these
>>> background tasks to truly idle cores, I'm seeing a nice boost in both
>>> background compilation and AI inference throughput.
>>>
>>> The reduction in sys time confirms that the domain balancing remains
>>> stable despite the refactor to sched_idle_rq(rq) as you and Prateek
>>> mentioned.
>>>
>>> 1. Background Kernel Compilation:
>>>
>>> I ran `time nice -n 19 make -j$nproc` to see how it handles a heavy
>>
>> nice -n 19 uses sched_other with prio 19 and not sched_idle so I'm
>> curious how you can see a difference ?
>> Or something is missing in your test description
>> Or we have a bug somewhere
>>
> 
> Okay, I realized I had used nice -n 19 (SCHED_OTHER) for the initial build, which wouldn't have directly triggered the SCHED_IDLE logic. But, I did use chrt for the schbench runs, which is why those p99 wins were so consistent.
> 
> I've re-run the kernel build using the correct chrt --idle 0 policy. On Ampere Altra, the throughput is along the same lines as mainline.
> 
> Metric    Mainline    Patched        Delta
> Real    9m 20.120s    9m 18.472s    -1.6s
> User    382m 24.966s    380m 41.716s    -1m 43s
> Sys    218m 26.192s    218m 44.908s    +18.7s
> 

Thanks for testing Shubhang, although I find it a bit surprising that your
kernel compilation under SCHED_IDLE doesn't improve.
Are you running with CONFIG_SCHED_CLUSTER=y? I'll try to reproduce.
Anyway at least you see a schbench improvement, I'm assuming I'll
keep you Tested-by?