[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <81418e43-22d6-9046-0179-b77e85234f4d@os.amperecomputing.com>
Date: Thu, 5 Feb 2026 10:52:42 -0800 (PST)
From: Shubhang Kaushik <shubhang@...amperecomputing.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
cc: Christian Loehle <christian.loehle@....com>, linux-kernel@...r.kernel.org,
peterz@...radead.org, mingo@...hat.com, juri.lelli@...hat.com,
dietmar.eggemann@....com, kprateek.nayak@....com, pierre.gondois@....com
Subject: Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task
On Thu, 5 Feb 2026, Vincent Guittot wrote:
> On Thu, 5 Feb 2026 at 01:00, Shubhang Kaushik
> <shubhang@...amperecomputing.com> wrote:
>>
>> On Tue, 3 Feb 2026, Christian Loehle wrote:
>>
>>> CPUs whose rq only have SCHED_IDLE tasks running are considered to be
>>> equivalent to truly idle CPUs during wakeup path. For fork and exec
>>> SCHED_IDLE is even preferred.
>>> This is based on the assumption that the SCHED_IDLE CPU is not in an
>>> idle state and might be in a higher P-state, allowing the task/wakee
>>> to run immediately without sharing the rq.
>>>
>>> However this assumption doesn't hold if the wakee has SCHED_IDLE policy
>>> itself, as it will share the rq with existing SCHED_IDLE tasks. In this
>>> case, we are better off continuing to look for a truly idle CPU.
>>>
>>> On a Intel Xeon 2-socket with 64 logical cores in total this yields
>>> for kernel compilation using SCHED_IDLE:
>>>
>>> +---------+----------------------+----------------------+--------+
>>> | workers | mainline (seconds) | patch (seconds) | delta% |
>>> +=========+======================+======================+========+
>>> | 1 | 4384.728 ± 21.085 | 3843.250 ± 16.235 | -12.35 |
>>> | 2 | 2242.513 ± 2.099 | 1971.696 ± 2.842 | -12.08 |
>>> | 4 | 1199.324 ± 1.823 | 1033.744 ± 1.803 | -13.81 |
>>> | 8 | 649.083 ± 1.959 | 559.123 ± 4.301 | -13.86 |
>>> | 16 | 370.425 ± 0.915 | 325.906 ± 4.623 | -12.02 |
>>> | 32 | 234.651 ± 2.255 | 217.266 ± 0.253 | -7.41 |
>>> | 64 | 202.286 ± 1.452 | 197.977 ± 2.275 | -2.13 |
>>> | 128 | 217.092 ± 1.687 | 212.164 ± 1.138 | -2.27 |
>>> +---------+----------------------+----------------------+--------+
>>>
>>> Signed-off-by: Christian Loehle <christian.loehle@....com>
>>
>> I’ve been testing this patch on an 80-core Ampere Altra (Neoverse-N1) and
>> the results look very solid. On these high-core-count ARM systems, we
>> definitely see the benefit of being pickier about where we place
>> SCHED_IDLE tasks.
>>
>> Treating an occupied SCHED_IDLE rq as idle seems to cause
>> unnecessary packing that shows up in the tail latency. By spreading these
>> background tasks to truly idle cores, I'm seeing a nice boost in both
>> background compilation and AI inference throughput.
>>
>> The reduction in sys time confirms that the domain balancing remains
>> stable despite the refactor to sched_idle_rq(rq) as you and Prateek
>> mentioned.
>>
>> 1. Background Kernel Compilation:
>>
>> I ran `time nice -n 19 make -j$nproc` to see how it handles a heavy
>
> nice -n 19 uses sched_other with prio 19 and not sched_idle so I'm
> curious how you can see a difference ?
> Or something is missing in your test description
> Or we have a bug somewhere
>
Okay, I realized I had used nice -n 19 (SCHED_OTHER) for the initial
build, which wouldn't have directly triggered the SCHED_IDLE logic.
But, I did use chrt for the schbench runs, which is why those p99 wins
were so consistent.
I've re-run the kernel build using the correct chrt --idle 0 policy. On
Ampere Altra, the throughput is along the same lines as mainline.
Metric Mainline Patched Delta
Real 9m 20.120s 9m 18.472s -1.6s
User 382m 24.966s 380m 41.716s -1m 43s
Sys 218m 26.192s 218m 44.908s +18.7s
>> background load. We
saved nearly 3 minutes of 'sys' time showing >> lower scheduler overhead.
>>
>> Mainline (6.19.0-rc8):
>> real 9m28.403s
>> sys 219m21.591s
>>
>> Patched:
>> real 9m16.167s (-12.2s)
>> sys 216m28.323s (-2m53s)
>>
>> I was initially concerned about the impact on domain balancing, but the
>> significant reduction in 'sys' time during the kernel build confirms that
>> we aren't seeing any regressive balancing overhead.
>>
>> 2. AI Inference (llama-batched-bench):
>>
>> For background LLM inference, the patch consistently delivered about 8.7%
>> more throughput when we're running near core saturation.
>>
>> 51 Threads: 30.03 t/s (vs 27.62 on Mainline) -> +8.7%
>> 80 Threads: 27.20 t/s (vs 25.01 on Mainline) -> +8.7%
>>
>> 3. Scheduler Latency using schbench:
>>
>> The biggest win was in the p99.9 tail latency. Under a locked workload,
>> the latency spikes dropped significantly.
>> 4 Threads (Locking): 10085 us (vs 12421 us) -> -18.8%
>> 8 Threads (Locking): 9563 us (vs 11589 us) -> -17.5%
>>
>> The patch really helps clean up the noise for background tasks on these
>> large ARM platforms. Nice work.
>>
>> Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
>>
>> Regards,
>> Shubhang Kaushik
>>
>>> int cpu = rq->cpu;
>>> - int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
>> ma> + int busy = idle != CPU_IDLE && !sched_idle_rq(rq);
>>> unsigned long interval;
>>> struct sched_domain *sd;
>>> /* Earliest time when we have to do rebalance again */
>>> @@ -12299,7 +12305,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
>>> * state even if we migrated tasks. Update it.
>>> */
>>> idle = idle_cpu(cpu);
>>> - busy = !idle && !sched_idle_cpu(cpu);
>>> + busy = !idle && !sched_idle_rq(rq);
>>> }
>>> sd->last_balance = jiffies;
>>> interval = get_sd_balance_interval(sd, busy);
>>> --
>>> 2.34.1
>>>
>>>
>
Powered by blists - more mailing lists