linux-kernel - Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <28d5f45e-dff0-4073-a806-f8cc6f9fd0aa@arm.com>
Date: Thu, 13 Nov 2025 15:54:59 +0100
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: Shubhang Kaushik OS <Shubhang@...amperecomputing.com>,
 Vincent Guittot <vincent.guittot@...aro.org>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>,
 Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
 Valentin Schneider <vschneid@...hat.com>, Shubhang Kaushik <sh@...two.org>,
 Shijie Huang <Shijie.Huang@...erecomputing.com>,
 Frank Wang <zwang@...erecomputing.com>, Christopher Lameter <cl@...two.org>,
 Adam Li <adam.li@...erecomputing.com>,
 "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

On 13.11.25 01:26, Shubhang Kaushik OS wrote:
>> From your previous answer on v1, I don't think that you use
>> heterogeneous system so eas will not be enabled in your case and even
>> when used find_energy_efficient_cpu() will be called before
> 
> I agree that the EAS centric approach in the current patch is misplaced for our homogeneous systems.
> 
>> Otherwise you might want to check in wake_affine() where we decide
>> between local cpu and previous cpu which one should be the target.
>> This can have an impact especially if there are not in the same LLC
> 
> While wake_affine() modifications seem logical, I see that they cause performance regressions across the board due to the inherent trade-offs in altering that critical initial decision point.

Which testcases are you running on your Altra box? I assume it's a
single NUMA node (80 CPUs).

For us, 'perf bench sched messaging` w/o CONFIG_SCHED_CLUSTER, so only
PKG SD (i.e. sis() only returns prev or this CPU) gives better results
then w/ CONFIG_SCHED_CLUSTER.

> We might need to solve the non-idle fallback within `select_idle_sibling` to ring fence the impact for preserving locality effectively.

IMHO, the scheduler only cares about shared LLC (and shared L2 with
CONFIG_SCHED_CLUSTER). Can you check:

$ cat /sys/devices/system/cpu/cpu0/cache/index*/{type,shared_cpu_map}
Data
Instruction
Unified
Unified                                                 <-- (1)
00000000,00000000,00000000,00000000,00000001
00000000,00000000,00000000,00000000,00000001
00000000,00000000,00000000,00000000,00000001
CPU mask > 00000000,00000000,00000000,00000000,00000001 <-- (1)

Does (1) exists? IMHO it doesn't.

I assume your machine is quite unique here. IIRC, you configure 2 CPUs
groups in your ACPI pptt which then form a 2 CPUs cluster_cpumask and
since your core_mask (in cpu_coregrop_mask()) has only 1 CPU, it gets
set to the cluster_cpumask so at the end you have a 2 CPU MC SD and no
CLS SD plus an 80 CPU PKG SD.

This CLS->MC propagation is somehow important since only then you get a
valid 'sd = rcu_dereference(per_cpu(sd_llc, target))' in sis() so you
not just return target (prev or this CPU).
But I can imagine that your MC cpumask is way too small for the SIS_UTIL
based selection of an idle CPU.

[...]