linux-kernel - Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8e5bbbc91ff9f74244efe916a4113999abc52213.camel@intel.com>
Date:   Sat, 12 Aug 2023 08:21:55 +0000
From:   "Zhang, Rui" <rui.zhang@...el.com>
To:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
CC:     "Gross, Jurgen" <jgross@...e.com>,
        "mikelley@...rosoft.com" <mikelley@...rosoft.com>,
        "arjan@...ux.intel.com" <arjan@...ux.intel.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "thomas.lendacky@....com" <thomas.lendacky@....com>,
        "ray.huang@....com" <ray.huang@....com>,
        "andrew.cooper3@...rix.com" <andrew.cooper3@...rix.com>,
        "Sivanich, Dimitri" <dimitri.sivanich@....com>,
        "wei.liu@...nel.org" <wei.liu@...nel.org>
Subject: Re: [patch V3 27/40] x86/cpu: Provide a sane leaf 0xb/0x1f parser

Hi, Thomas,

On Wed, 2023-08-02 at 12:21 +0200, Thomas Gleixner wrote:
> detect_extended_topology() along with it's early() variant is a
> classic
> example for duct tape engineering:
> 
>   - It evaluates an array of subleafs with a boatload of local
> variables
>     for the relevant topology levels instead of using an array to
> save the
>     enumerated information and propagate it to the right level
> 
>   - It has no boundary checks for subleafs
> 
>   - It prevents updating the die_id with a crude workaround instead
> of
>     checking for leaf 0xb which does not provide die information.
> 
>   - It's broken vs. the number of dies evaluation as it uses:
> 
>       num_processors[DIE_LEVEL] / num_processors[CORE_LEVEL]
> 
>     which "works" only correctly if there is none of the intermediate
>     topology levels (MODULE/TILE) enumerated.
> 
> There is zero value in trying to "fix" that code as the only proper
> fix is
> to rewrite it from scratch.
> 
> Implement a sane parser with proper code documentation, which will be
> used
> for the consolidated topology evaluation in the next step.
> 
> Signed-off-by: Thomas Gleixner <tglx@...utronix.de>
> ---
> V2: Fixed up the comment alignment for registers - Peterz
> ---
>  arch/x86/kernel/cpu/Makefile       |    2 
>  arch/x86/kernel/cpu/topology.h     |   12 +++
>  arch/x86/kernel/cpu/topology_ext.c |  136
> +++++++++++++++++++++++++++++++++++++
>  3 files changed, 149 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/cpu/Makefile
> +++ b/arch/x86/kernel/cpu/Makefile
> @@ -18,7 +18,7 @@ KMSAN_SANITIZE_common.o := n
>  KCSAN_SANITIZE_common.o := n
>  
>  obj-y                  := cacheinfo.o scattered.o
> -obj-y                  += topology_common.o topology.o
> +obj-y                  += topology_common.o topology_ext.o
> topology.o
>  obj-y                  += common.o
>  obj-y                  += rdrand.o
>  obj-y                  += match.o
> --- a/arch/x86/kernel/cpu/topology.h
> +++ b/arch/x86/kernel/cpu/topology.h
> @@ -16,6 +16,7 @@ void cpu_init_topology(struct cpuinfo_x8
>  void cpu_parse_topology(struct cpuinfo_x86 *c);
>  void topology_set_dom(struct topo_scan *tscan, enum
> x86_topology_domains dom,
>                       unsigned int shift, unsigned int ncpus);
> +bool cpu_parse_topology_ext(struct topo_scan *tscan);
>  
>  static inline u32 topo_shift_apicid(u32 apicid, enum
> x86_topology_domains dom)
>  {
> @@ -31,4 +32,15 @@ static inline u32 topo_relative_domain_i
>         return apicid & (x86_topo_system.dom_size[dom] - 1);
>  }
>  
> +/*
> + * Update a domain level after the fact without propagating. Used to
> fixup
> + * broken CPUID enumerations.
> + */
> +static inline void topology_update_dom(struct topo_scan *tscan, enum
> x86_topology_domains dom,
> +                                      unsigned int shift, unsigned
> int ncpus)
> +{
> +       tscan->dom_shifts[dom] = shift;
> +       tscan->dom_ncpus[dom] = ncpus;
> +}
> +
>  #endif /* ARCH_X86_TOPOLOGY_H */
> --- /dev/null
> +++ b/arch/x86/kernel/cpu/topology_ext.c
> @@ -0,0 +1,136 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/apic.h>
> +#include <asm/memtype.h>
> +#include <asm/processor.h>
> +
> +#include "cpu.h"
> +
> +enum topo_types {
> +       INVALID_TYPE    = 0,
> +       SMT_TYPE        = 1,
> +       CORE_TYPE       = 2,
> +       MODULE_TYPE     = 3,
> +       TILE_TYPE       = 4,
> +       DIE_TYPE        = 5,
> +       DIEGRP_TYPE     = 6,
> +       MAX_TYPE        = 7,
> +};
> +
> +/*
> + * Use a lookup table for the case that there are future types > 6
> which
> + * describe an intermediate domain level which does not exist today.
> + *
> + * A table will also be handy to parse the new AMD 0x80000026 leaf
> which
> + * has defined different domain types, but otherwise uses the same
> layout
> + * with some of the reserved bits used for new information.
> + */
> +static const unsigned int topo_domain_map[MAX_TYPE] = {
> +       [SMT_TYPE]      = TOPO_SMT_DOMAIN,
> +       [CORE_TYPE]     = TOPO_CORE_DOMAIN,
> +       [MODULE_TYPE]   = TOPO_MODULE_DOMAIN,
> +       [TILE_TYPE]     = TOPO_TILE_DOMAIN,
> +       [DIE_TYPE]      = TOPO_DIE_DOMAIN,
> +       [DIEGRP_TYPE]   = TOPO_PKG_DOMAIN,

May I know why DIEGRP_TYPE is mapped to TOPO_PKG_DOMAIN?

> +};
> +
> +static inline bool topo_subleaf(struct topo_scan *tscan, u32 leaf,
> u32 subleaf)
> +{
> +       unsigned int dom, maxtype = leaf == 0xb ? CORE_TYPE + 1 :
> MAX_TYPE;
> +       struct {
> +               // eax
> +               u32     x2apic_shift    :  5, // Number of bits to
> shift APIC ID right
> +                                             // for the topology ID
> at the next level
> +                       __rsvd0         : 27; // Reserved
> +               // ebx
> +               u32     num_processors  : 16, // Number of processors
> at current level
> +                       __rsvd1         : 16; // Reserved
> +               // ecx
> +               u32     level           :  8, // Current topology
> level. Same as sub leaf number
> +                       type            :  8, // Level type. If 0,
> invalid
> +                       __rsvd2         : 16; // Reserved
> +               // edx
> +               u32     x2apic_id       : 32; // X2APIC ID of the
> current logical processor
> +       } sl;
> +
> +       cpuid_subleaf(leaf, subleaf, &sl);
> +
> +       if (!sl.num_processors || sl.type == INVALID_TYPE)
> +               return false;
> +
> +       if (sl.type >= maxtype) {

It is still legal to have sparse type value in the future, and then
this check will break.
IMO, it is better to use a function to convert type to domain, and
check for unknown domain here, say, something like

diff --git a/arch/x86/kernel/cpu/topology_ext.c
b/arch/x86/kernel/cpu/topology_ext.c
index 5ddc5d24435e..7720a7bc7478 100644
--- a/arch/x86/kernel/cpu/topology_ext.c
+++ b/arch/x86/kernel/cpu/topology_ext.c
@@ -26,14 +26,27 @@ enum topo_types {
  * has defined different domain types, but otherwise uses the same
layout
  * with some of the reserved bits used for new information.
  */
-static const unsigned int topo_domain_map[MAX_TYPE] = {
-	[SMT_TYPE]	= TOPO_SMT_DOMAIN,
-	[CORE_TYPE]	= TOPO_CORE_DOMAIN,
-	[MODULE_TYPE]	= TOPO_MODULE_DOMAIN,
-	[TILE_TYPE]	= TOPO_TILE_DOMAIN,
-	[DIE_TYPE]	= TOPO_DIE_DOMAIN,
-	[DIEGRP_TYPE]	= TOPO_PKG_DOMAIN,
-};
+
+static enum x86_topology_domains topo_type_to_domain(int type)
+{
+	switch (type) {
+	case SMT_TYPE:
+		return TOPO_SMT_DOMAIN;
+	case CORE_TYPE:
+		return TOPO_CORE_DOMAIN;
+	case MODULE_TYPE:
+		return TOPO_MODULE_DOMAIN;
+	case TILE_TYPE:
+		return TOPO_TILE_DOMAIN;
+	case DIE_TYPE:
+		return TOPO_DIE_DOMAIN;
+	case DIEGRP_TYPE:
+		return TOPO_PKG_DOMAIN;
+	default:
+		return TOPO_MAX_DOMAIN;
+	}
+
+}
 
 static inline bool topo_subleaf(struct topo_scan *tscan, u32 leaf, u32
subleaf)
 {
@@ -59,7 +72,8 @@ static inline bool topo_subleaf(struct topo_scan
*tscan, u32 leaf, u32 subleaf)
 	if (!sl.num_processors || sl.type == INVALID_TYPE)
 		return false;
 
-	if (sl.type >= maxtype) {
+	dom = topo_type_to_domain(sl.type);
+	if (dom == TOPO_MAX_DOMAIN) {
 		/*
 		 * As the subleafs are ordered in domain level order,
this
 		 * could be recovered in theory by propagating the
@@ -84,7 +98,6 @@ static inline bool topo_subleaf(struct topo_scan
*tscan, u32 leaf, u32 subleaf)
 		return true;
 	}
 
-	dom = topo_domain_map[sl.type];
 	if (!dom) {
 		tscan->c->topo.initial_apicid = sl.x2apic_id;
 	} else if (tscan->c->topo.initial_apicid != sl.x2apic_id) {

> +               /*
> +                * As the subleafs are ordered in domain level order,
> this
> +                * could be recovered in theory by propagating the
> +                * information at the last parsed level.
> +                *
> +                * But if the infinite wisdom of hardware folks
> decides to
> +                * create a new domain type between CORE and MODULE
> or DIE
> +                * and DIEGRP, then that would overwrite the CORE or
> DIE
> +                * information.

Sorry that I'm confused here.

Say, we have CORE, FOO, MODULE, then the subleave of FOO must be higher
than CORE but lower than MODULE.
so we parse CORE first and propagates the info to FOO/MODULE, and then
parse FOO and propagate to MODULE, and parse MODULE in the end.
How could we overwrite the info of a lower level?

> +                *
> +                * It really would have been too obvious to make the
> domain
> +                * type space sparse and leave a few reserved types
> between
> +                * the points which might change instead of forcing
> +                * software to either create a monstrosity of
> workarounds
> +                * or just being up the creek without a paddle.

Agreed.
with sparse type space, we know the relationship between different
types, without knowing what the type really means.

> +                *
> +                * Refuse to implement monstrosity, emit an error and try
> +                * to survive.
> +                */
> +               pr_err_once("Topology: leaf 0x%x:%d Unknown domain type %u\n",
> +                           leaf, subleaf, sl.type);
> +               return true;

Don't want to be TLDR, I can think of a couple cases that breaks Linux
in different ways if we ignore the cpu topology info of an unknown
level.

So I just want to understand the strategy here, does this mean that
we're not looking for a future proof solution, and instead we are
planning to take future updates (patch enum topo_types/enum
x86_topology_domains/topo_domain_map) whenever a new level is invented?


TBH, I'm still thinking of a future proof proposal here.
currently, Linux only cares about pkg_id/core_id/die_id, and the
relationship between these three levels.
1. for package id: pkg_id_low = FOO.x2apic_shift (FOO is the highest
enumerated level, no matter its type is known or not)
2. for core_id: as SMT level is always enumerated, core_id_low =
SMT.x2apic_shift, core_id_high = pkg_id_low - 1;
3. for die_id: Make Linux Die *OPTIONAL*.
   when DIE is enumerated via CPUID.1F, die_id_low = FOO.x2apic_shift
(FOO is the next enumerated lower level of DIE, no matter its type is
known or not), die_id_high = pkg_id_low - 1;
   when DIE is not enumerated via CPUID.1F, then Linux die does not
exist, adjust the die related topology information, say, die_id = -1,
topology_max_dies_per_package = 0, etc, and don't expose die sysfs I/F.

With this, we can guarantee that all the available topology information
are always valid, even when running on future platforms.

what do you think?

thanks,
rui