linux-kernel - Re: [PATCH] x86: Consider multiple nodes in a single socket to be "sane"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140916032920.GH2840@worktop.localdomain>
Date:	Tue, 16 Sep 2014 05:29:20 +0200
From:	Peter Zijlstra <peterz@...radead.org>
To:	Dave Hansen <dave@...1.net>
Cc:	linux-kernel@...r.kernel.org, borislav.petkov@....com,
	andreas.herrmann3@....com, mingo@...nel.org, hpa@...ux.intel.com,
	ak@...ux.intel.com
Subject: Re: [PATCH] x86: Consider multiple nodes in a single socket to be
 "sane"

On Mon, Sep 15, 2014 at 03:26:41PM -0700, Dave Hansen wrote:
> 
> I'm getting the spew below when booting with Haswell (Xeon
> E5-2699) CPUs and the "Cluster-on-Die" (CoD) feature enabled in
> the BIOS. 

What is that cluster-on-die thing? I've heard it before but never could
find anything on it.

> This also fixes sysfs because CPUs with the same 'physical_package_id'
> in /sys/devices/system/cpu/cpu*/topology/ are not listed together
> in the same 'core_siblings_list'.  This violates a statement from
> Documentation/ABI/testing/sysfs-devices-system-cpu:
> 
> 	core_siblings: internal kernel map of cpu#'s hardware threads
> 	within the same physical_package_id.
> 
> 	core_siblings_list: human-readable list of the logical CPU
> 	numbers within the same physical_package_id as cpu#.

No that statement is wrong; it assumes physical_package_id is a good
identifier for nodes. Clearly this is no longer true.

The idea is that core_siblings (or rather cpu_core_mask) is a mask of
all cores on a node. 

> The sysfs effects here cause an issue with the hwloc tool where
> it gets confused and thinks there are more sockets than are
> physically present.

Meh, so then we need another mask.

The important bit you didn't show was the scheduler domain setup. I
suspect it all works by accident, not by design.

> diff -puN arch/x86/kernel/smpboot.c~hsw-cod-is-sane arch/x86/kernel/smpboot.c
> --- a/arch/x86/kernel/smpboot.c~hsw-cod-is-sane	2014-09-15 14:56:20.012314468 -0700
> +++ b/arch/x86/kernel/smpboot.c	2014-09-15 14:58:58.837506644 -0700
> @@ -344,10 +344,13 @@ static bool match_llc(struct cpuinfo_x86
>  static bool match_mc(struct cpuinfo_x86 *c, struct cpuinfo_x86 *o)
>  {
>  	if (c->phys_proc_id == o->phys_proc_id) {
> -		if (cpu_has(c, X86_FEATURE_AMD_DCM))
> -			return true;
> -
> -		return topology_sane(c, o, "mc");
> +		/*
> +		 * We used to enforce that 'c' and 'o' be on the
> +		 * same node, but AMD's DCM and Intel's Cluster-
> +		 * on-Die (CoD) support both have physical
> +		 * processors that span NUMA nodes.
> +		 */
> +		return true;
>  	}
>  	return false;
>  }

This is wrong (and I suppose the AMD case was already wrong). That
function is supposed to match a multi-core group which is very much
supposed to be smaller-or-equal to a node, not spanning nodes.

The scheduler assumes: SMT <= LLC <= MC <= NODE and if setting the MC
mask to cover multiple nodes works, its by accident.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/