linux-kernel - Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID:
 <MW6PR01MB83686A7A6D627A7A056CF0E1F5CAA@MW6PR01MB8368.prod.exchangelabs.com>
Date: Fri, 14 Nov 2025 18:27:13 +0000
From: Shubhang Kaushik OS <Shubhang@...amperecomputing.com>
To: Dietmar Eggemann <dietmar.eggemann@....com>, Vincent Guittot
	<vincent.guittot@...aro.org>
CC: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>, Steven Rostedt <rostedt@...dmis.org>, Ben
 Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>, Valentin Schneider
	<vschneid@...hat.com>, Shubhang Kaushik <sh@...two.org>, Shijie Huang
	<Shijie.Huang@...erecomputing.com>, Frank Wang <zwang@...erecomputing.com>,
	Christopher Lameter <cl@...two.org>, Adam Li <adam.li@...erecomputing.com>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

Our current kernel has CONFIG_SCHED_CLUSTER enabled. While it shows 2 NUMA nodes, only node0 is the one containing 0-79 cores.

NUMA:
  NUMA node(s):              2
  NUMA node0 CPU(s):         0-79
  NUMA node1 CPU(s):

I run similar perf testcases along with MySQL and AI workloads.

> IMHO, the scheduler only cares about shared LLC (and shared L2 with
> CONFIG_SCHED_CLUSTER). Can you check:

$ cat /sys/devices/system/cpu/cpu0/cache/index*/{type,shared_cpu_map}
Data
Instruction
Unified
0000,00000000,00000001
0000,00000000,00000001
0000,00000000,00000001

The output confirms that the extra Unified cache entry (1) does not exist in our sysfs view.
Correct, this Altra machine only a 2-CPU MC SD, which results in the small MC cpumask.

[
    "cpu78",
    {
        "MC": "['78-79']",
        "PKG": "['0-79']"
    }
][
    "cpu79",
    {
        "MC": "['78-79']",
        "PKG": "['0-79']"
    }
]

Thanks,
Shubhang Kaushik

________________________________________
From: Dietmar Eggemann <dietmar.eggemann@....com>
Sent: Thursday, November 13, 2025 6:54 AM
To: Shubhang Kaushik OS; Vincent Guittot
Cc: Ingo Molnar; Peter Zijlstra; Juri Lelli; Steven Rostedt; Ben Segall; Mel Gorman; Valentin Schneider; Shubhang Kaushik; Shijie Huang; Frank Wang; Christopher Lameter; Adam Li; linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched/fair: Prefer cache locality for EAS wakeup

On 13.11.25 01:26, Shubhang Kaushik OS wrote:
>> From your previous answer on v1, I don't think that you use
>> heterogeneous system so eas will not be enabled in your case and even
>> when used find_energy_efficient_cpu() will be called before
>
> I agree that the EAS centric approach in the current patch is misplaced for our homogeneous systems.
>
>> Otherwise you might want to check in wake_affine() where we decide
>> between local cpu and previous cpu which one should be the target.
>> This can have an impact especially if there are not in the same LLC
>
> While wake_affine() modifications seem logical, I see that they cause performance regressions across the board due to the inherent trade-offs in altering that critical initial decision point.

Which testcases are you running on your Altra box? I assume it's a
single NUMA node (80 CPUs).

For us, 'perf bench sched messaging` w/o CONFIG_SCHED_CLUSTER, so only
PKG SD (i.e. sis() only returns prev or this CPU) gives better results
then w/ CONFIG_SCHED_CLUSTER.

> We might need to solve the non-idle fallback within `select_idle_sibling` to ring fence the impact for preserving locality effectively.

IMHO, the scheduler only cares about shared LLC (and shared L2 with
CONFIG_SCHED_CLUSTER). Can you check:

$ cat /sys/devices/system/cpu/cpu0/cache/index*/{type,shared_cpu_map}
Data
Instruction
Unified
Unified                                                 <-- (1)
00000000,00000000,00000000,00000000,00000001
00000000,00000000,00000000,00000000,00000001
00000000,00000000,00000000,00000000,00000001
CPU mask > 00000000,00000000,00000000,00000000,00000001 <-- (1)

Does (1) exists? IMHO it doesn't.

I assume your machine is quite unique here. IIRC, you configure 2 CPUs
groups in your ACPI pptt which then form a 2 CPUs cluster_cpumask and
since your core_mask (in cpu_coregrop_mask()) has only 1 CPU, it gets
set to the cluster_cpumask so at the end you have a 2 CPU MC SD and no
CLS SD plus an 80 CPU PKG SD.

This CLS->MC propagation is somehow important since only then you get a
valid 'sd = rcu_dereference(per_cpu(sd_llc, target))' in sis() so you
not just return target (prev or this CPU).
But I can imagine that your MC cpumask is way too small for the SIS_UTIL
based selection of an idle CPU.

[...]