lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7313b320-779a-f5ad-418c-9c15f0cc6986@os.amperecomputing.com>
Date: Wed, 4 Feb 2026 16:00:51 -0800 (PST)
From: Shubhang Kaushik <shubhang@...amperecomputing.com>
To: Christian Loehle <christian.loehle@....com>, linux-kernel@...r.kernel.org, 
    peterz@...radead.org, mingo@...hat.com, vincent.guittot@...aro.org
cc: juri.lelli@...hat.com, dietmar.eggemann@....com, kprateek.nayak@....com, 
    pierre.gondois@....com
Subject: Re: [PATCHv2] sched/fair: Skip SCHED_IDLE rq for SCHED_IDLE task

On Tue, 3 Feb 2026, Christian Loehle wrote:

> CPUs whose rq only have SCHED_IDLE tasks running are considered to be
> equivalent to truly idle CPUs during wakeup path. For fork and exec
> SCHED_IDLE is even preferred.
> This is based on the assumption that the SCHED_IDLE CPU is not in an
> idle state and might be in a higher P-state, allowing the task/wakee
> to run immediately without sharing the rq.
>
> However this assumption doesn't hold if the wakee has SCHED_IDLE policy
> itself, as it will share the rq with existing SCHED_IDLE tasks. In this
> case, we are better off continuing to look for a truly idle CPU.
>
> On a Intel Xeon 2-socket with 64 logical cores in total this yields
> for kernel compilation using SCHED_IDLE:
>
> +---------+----------------------+----------------------+--------+
> | workers | mainline (seconds)   | patch (seconds)      | delta% |
> +=========+======================+======================+========+
> |       1 | 4384.728 ± 21.085    | 3843.250 ± 16.235    | -12.35 |
> |       2 | 2242.513 ± 2.099     | 1971.696 ± 2.842     | -12.08 |
> |       4 | 1199.324 ± 1.823     | 1033.744 ± 1.803     | -13.81 |
> |       8 |  649.083 ± 1.959     |  559.123 ± 4.301     | -13.86 |
> |      16 |  370.425 ± 0.915     |  325.906 ± 4.623     | -12.02 |
> |      32 |  234.651 ± 2.255     |  217.266 ± 0.253     |  -7.41 |
> |      64 |  202.286 ± 1.452     |  197.977 ± 2.275     |  -2.13 |
> |     128 |  217.092 ± 1.687     |  212.164 ± 1.138     |  -2.27 |
> +---------+----------------------+----------------------+--------+
>
> Signed-off-by: Christian Loehle <christian.loehle@....com>

I’ve been testing this patch on an 80-core Ampere Altra (Neoverse-N1) and 
the results look very solid. On these high-core-count ARM systems, we 
definitely see the benefit of being pickier about where we place 
SCHED_IDLE tasks.

Treating an occupied SCHED_IDLE rq as idle seems to cause 
unnecessary packing that shows up in the tail latency. By spreading these 
background tasks to truly idle cores, I'm seeing a nice boost in both 
background compilation and AI inference throughput.

The reduction in sys time confirms that the domain balancing remains 
stable despite the refactor to sched_idle_rq(rq) as you and Prateek 
mentioned.

1. Background Kernel Compilation:

I ran `time nice -n 19 make -j$nproc` to see how it handles a heavy 
background load. We saved nearly 3 minutes of 'sys' time showing
lower scheduler overhead.

Mainline (6.19.0-rc8):
real 9m28.403s
sys 219m21.591s

Patched:
real 9m16.167s (-12.2s)
sys 216m28.323s (-2m53s)

I was initially concerned about the impact on domain balancing, but the 
significant reduction in 'sys' time during the kernel build confirms that 
we aren't seeing any regressive balancing overhead.

2. AI Inference (llama-batched-bench):

For background LLM inference, the patch consistently delivered about 8.7% 
more throughput when we're running near core saturation.

51 Threads: 30.03 t/s (vs 27.62 on Mainline) -> +8.7%
80 Threads: 27.20 t/s (vs 25.01 on Mainline) -> +8.7%

3. Scheduler Latency using schbench:

The biggest win was in the p99.9 tail latency. Under a locked workload, 
the latency spikes dropped significantly.
4 Threads (Locking): 10085 us (vs 12421 us) -> -18.8%
8 Threads (Locking): 9563 us (vs 11589 us) -> -17.5%

The patch really helps clean up the noise for background tasks on these 
large ARM platforms. Nice work.

Tested-by: Shubhang Kaushik <shubhang@...amperecomputing.com>

Regards,
Shubhang Kaushik

> 	int cpu = rq->cpu;
> -	int busy = idle != CPU_IDLE && !sched_idle_cpu(cpu);
ma> +	int busy = idle != CPU_IDLE && !sched_idle_rq(rq);
> 	unsigned long interval;
> 	struct sched_domain *sd;
> 	/* Earliest time when we have to do rebalance again */
> @@ -12299,7 +12305,7 @@ static void sched_balance_domains(struct rq *rq, enum cpu_idle_type idle)
> 				 * state even if we migrated tasks. Update it.
> 				 */
> 				idle = idle_cpu(cpu);
> -				busy = !idle && !sched_idle_cpu(cpu);
> +				busy = !idle && !sched_idle_rq(rq);
> 			}
> 			sd->last_balance = jiffies;
> 			interval = get_sd_balance_interval(sd, busy);
> -- 
> 2.34.1
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ