linux-kernel - Re: [PATCH v3] sched/topology: Enable topology_span

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0774ea1-27ac-4aa1-9a96-5e27aa8328e6@amd.com>
Date: Thu, 6 Feb 2025 14:40:13 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Peter Zijlstra <peterz@...radead.org>
CC: Naman Jain <namjain@...ux.microsoft.com>, Ingo Molnar <mingo@...hat.com>,
	Juri Lelli <juri.lelli@...hat.com>, Vincent Guittot
	<vincent.guittot@...aro.org>, Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>, Ben Segall <bsegall@...gle.com>, Mel
 Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>,
	<stable@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Steve Wahl
	<steve.wahl@....com>, Saurabh Singh Sengar <ssengar@...ux.microsoft.com>,
	<srivatsa@...il.mit.edu>, Michael Kelley <mhklinux@...look.com>
Subject: Re: [PATCH v3] sched/topology: Enable topology_span_sane check only
 for debug builds

Hello Peter,

On 2/5/2025 3:46 PM, Peter Zijlstra wrote:
> On Wed, Feb 05, 2025 at 03:43:54PM +0530, K Prateek Nayak wrote:
>> Hello Peter,
>>
>> Thank you for the background!
>>
>> On 2/5/2025 3:25 PM, Peter Zijlstra wrote:
>>> On Wed, Feb 05, 2025 at 03:18:24PM +0530, K Prateek Nayak wrote:
>>>
>>>> Have there been any reports on an x86 system / VM where
>>>> topology_span_sane() was tripped?
>>>
>>> At the very least Intel SNC 'feature' tripped it at some point. They
>>> figured it made sense to have the LLC span two nodes.

I'm 99% sure that this might have been the topology_sane() check on
the x86 side and not the topology_span_sane() check in
kernel/sched/topology.c

I believe one of the original changes that did the plumbing for SNC was
commit 2c88d45edbb8 ("x86, sched: Treat Intel SNC topology as default,
COD as exception") from Alison where they mentions that they saw the
following splat when running with SNC:

     sched: CPU #3's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

This comes from the topology_sane() check in arch/x86/boot/smpboot.c
and match_llc() on x86 side was modified to work around that.

>>>
>>> But I think there were some really dodgy VMs too.

For VMs too, it is easy to trip topology_sane() check on x86 side. With
QEMU, I can run:

     qemu-system-x86_64 -enable-kvm -cpu host \
     -smp cpus=32,sockets=2,cores=8,threads=2 \
     ...
     -numa node,cpus=0-7,cpus=16-23,memdev=m0,nodeid=0 \
     -numa node,cpus=8-15,cpus=24-31,memdev=m1,nodeid=1 \
     ...

and I get:

     sched: CPU #8's llc-sibling CPU #0 is not on the same node! [node: 1 != 0]. Ignoring dependency.

This is because consecutive CPUs (0-1,2-3,...) are SMT siblings and
CPUs 0-15 are on the same socket as a result of how QEMU presents
MADT to the guest but then I go ahead and mess things up by saying
CPUs 0-7,16-23 are on one NUMA node, and the rest are on the other.

I still haven't managed to trip topology_span_sane() tho.

>>>
>>> But yeah, its not been often. But basically dodgy BIOS/VM data can mess
>>> up things badly enough for it to trip.
>>
>> Has it ever happened without tripping the topology_sane() check first
>> on the x86 side?

What topology_span_sane() does is, it iterates over all the CPUs at a
given topology level and makes sure that the cpumask for a CPU at
that domain is same as the cpumask of every other CPU set on that mask
for that topology level.

If two CPUs are set on a mask, they should have the same mask. If CPUs
are not set on each other's mask, the masks should be disjoint.

On x86, the way set_cpu_sibling_map() works, CPUs are set on each other's
shared masks iff match_*() returns true:

o For SMT, this means:

   - If X86_FEATURE_TOPOEXT is set:
     - pkg_id must match.
     - die_id must match.
     - amd_node_id must match.
     - llc_id must match.
     - Either core_id or cu_id must match. (*)
     - NUMA nodes must match.

   - If !X86_FEATURE_TOPOEXT:
     - pkg_id must match.
     - die_id must match.
     - core_id must match.
     - NUMA nodes must match.

o For CLUSTER this means:

   - If l2c_id is not populated (== BAD_APICID)
     - Same conditions as SMT.

   - If l2c_id is populated (!= BAD_APICID)
     - l2c_id must match.
     - NUMA nodes must match.

o For MC it means:

   - llc_id must be populated (!= BAD_APICID) and must match.
   - If INTEL_SNC: pkg_id must match.
   - If !INTEL_SNC: NUMA nodes must match.

o For PKG domain:
   
   - Inserted only if !x86_has_numa_in_package.
   - CPUs should be in same NUMA node.

All in all, other that the one (*) decision point, everything else has
to strictly match for CPUs to be set in each other's CPU mask. And if
they match with one CPU, they should match will all other CPUs in mask
and it they mismatch with one, they should mismatch with all leading
to link_mask() never being called.

This is why I think that the topology_span_sane() check is redundant
when the x86 bits have already ensured masks cannot overlap in all
cases except for potentially in the (*) case.

So circling back to my original question around "SDTL_ARCH_VERIFIED",
would folks be okay to an early bailout from topology_span_sane() on:

     if (!sched_debug() && (tl->flags & SDTL_ARCH_VERIFIED))
     	return;

and more importantly, do folks care enough about topology_span_sane()
to have it run on other architectures and not just have it guarded
behind just "sched_debug()" which starts off as false by default?

(Sorry for the long answer explaining my thought process.)

> 
> That I can't remember, sorry :/

-- 
Thanks and Regards,
Prateek