linux-kernel - Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKfTPtDVn=VwhfNSsws5BtBe9x98Y0N6m3MfVtMd=+5NPVUrMA@mail.gmail.com>
Date: Thu, 5 Feb 2026 08:25:55 +0100
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Shubhang Kaushik <shubhang@...amperecomputing.com>
Cc: Christian Loehle <christian.loehle@....com>, linux-kernel@...r.kernel.org, 
	peterz@...radead.org, mingo@...hat.com, juri.lelli@...hat.com, 
	dietmar.eggemann@....com, kprateek.nayak@....com, pierre.gondois@....com
Subject: Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task

On Thu, 5 Feb 2026 at 01:00, Shubhang Kaushik
<shubhang@...amperecomputing.com> wrote:
>
> On Tue, 3 Feb 2026, Christian Loehle wrote:
>
> > CPUs whose rq only have SCHED_IDLE tasks running are considered to be
> > equivalent to truly idle CPUs during wakeup path. For fork and exec
> > SCHED_IDLE is even preferred.
> > This is based on the assumption that the SCHED_IDLE CPU is not in an
> > idle state and might be in a higher P-state, allowing the task/wakee
> > to run immediately without sharing the rq.
> >
> > However this assumption doesn't hold if the wakee has SCHED_IDLE policy
> > itself, as it will share the rq with existing SCHED_IDLE tasks. In this
> > case, we are better off continuing to look for a truly idle CPU.
> >
> > On a Intel Xeon 2-socket with 64 logical cores in total this yields
> > for kernel compilation using SCHED_IDLE:
> >
> > +---------+----------------------+----------------------+--------+
> > | workers | mainline (seconds)   | patch (seconds)      | delta% |
> > +=========+======================+======================+========+
> > |       1 | 4384.728 ± 21.085    | 3843.250 ± 16.235    | -12.35 |
> > |       2 | 2242.513 ± 2.099     | 1971.696 ± 2.842     | -12.08 |
> > |       4 | 1199.324 ± 1.823     | 1033.744 ± 1.803     | -13.81 |
> > |       8 |  649.083 ± 1.959     |  559.123 ± 4.301     | -13.86 |
> > |      16 |  370.425 ± 0.915     |  325.906 ± 4.623     | -12.02 |
> > |      32 |  234.651 ± 2.255     |  217.266 ± 0.253     |  -7.41 |
> > |      64 |  202.286 ± 1.452     |  197.977 ± 2.275     |  -2.13 |
> > |     128 |  217.092 ± 1.687     |  212.164 ± 1.138     |  -2.27 |
> > +---------+----------------------+----------------------+--------+
> >
> > Signed-off-by: Christian Loehle <christian.loehle@....com>
>
> I’ve been testing this patch on an 80-core Ampere Altra (Neoverse-N1) and
> the results look very solid. On these high-core-count ARM systems, we
> definitely see the benefit of being pickier about where we place
> SCHED_IDLE tasks.
>
> Treating an occupied SCHED_IDLE rq as idle seems to cause
> unnecessary packing that shows up in the tail latency. By spreading these
> background tasks to truly idle cores, I'm seeing a nice boost in both
> background compilation and AI inference throughput.
>
> The reduction in sys time confirms that the domain balancing remains
> stable despite the refactor to sched_idle_rq(rq) as you and Prateek
> mentioned.
>
> 1. Background Kernel Compilation:
>
> I ran `time nice -n 19 make -j$nproc` to see how it handles a heavy

nice -n 19 uses sched_other with prio 19 and not sched_idle so I'm
curious how you can see a difference ?
Or something is missing in your test description
Or we have a bug somewhere

> background load. We saved nearly 3 minutes of 'sys' time showing
> lower scheduler overhead.
>
> Mainline (6.19.0-rc8):
> real 9m28.403s
> sys 219m21.591s
>
> Patched:
> real 9m16.167s (-12.2s)
> sys 216m28.323s (-2m53s)
>
> I was initially concerned about the impact on domain balancing, but the
> significant reduction in 'sys' time during the kernel build confirms that
> we aren't seeing any regressive balancing overhead.
>
> 2. AI Inference (llama-batched-bench):
>
> For background LLM inference, the patch consistently delivered about 8.7%
> more throughput when we're running near core saturation.
>
> 51 Threads: 30.03 t/s (vs 27.62 on Mainline) -> +8.7%
> 80 Threads: 27.20 t/s (vs 25.01 on Mainline) -> +8.7%
>
> 3. Scheduler Latency using schbench:
>
> The biggest win was in the p99.9 tail latency. Under a locked workload,
> the latency spikes dropped significantly.
> 4 Threads (Locking): 10085 us (vs 12421 us) -> -18.8%
> 8 Threads (Locking): 9563 us (vs 11589 us) -> -17.5%
>
> The patch really helps clean up the noise for background tasks on these
> large ARM platforms. Nice work.
>
> Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>
>
> Regards,
> Shubhang Kaushik
>
> >       int cpu = rq->cpu;
> > -     int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
> ma> +   int busy = idle != CPU_IDLE && !sched_idle_rq(rq);
> >       unsigned long interval;
> >       struct sched_domain *sd;
> >       /* Earliest time when we have to do rebalance again */
> > @@ -12299,7 +12305,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> >                                * state even if we migrated tasks. Update it.
> >                                */
> >                               idle = idle_cpu(cpu);
> > -                             busy = !idle && !sched_idle_cpu(cpu);
> > +                             busy = !idle && !sched_idle_rq(rq);
> >                       }
> >                       sd->last_balance = jiffies;
> >                       interval = get_sd_balance_interval(sd, busy);
> > --
> > 2.34.1
> >
> >