lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230725201614.GB97682@hpe.com>
Date:   Tue, 25 Jul 2023 15:17:33 -0500
From:   Dimitri Sivanich <sivanich@....com>
To:     Thomas Gleixner <tglx@...utronix.de>
Cc:     LKML <linux-kernel@...r.kernel.org>, x86@...nel.org,
        Tom Lendacky <thomas.lendacky@....com>,
        Andrew Cooper <andrew.cooper3@...rix.com>,
        Arjan van de Ven <arjan@...ux.intel.com>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        Dick Kennedy <dick.kennedy@...adcom.com>,
        James Smart <james.smart@...adcom.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        linux-scsi@...r.kernel.org, linux-hwmon@...r.kernel.org,
        Jean Delvare <jdelvare@...e.com>,
        Huang Rui <ray.huang@....com>,
        Guenter Roeck <linux@...ck-us.net>,
        Steve Wahl <steve.wahl@....com>,
        Dimitri Sivanich <dimitri.sivanich@....com>,
        Russ Anderson <russ.anderson@....com>
Subject: Re: [patch 00/29] x86/cpu: Rework the topology evaluation

We have successfully booted a 32-socket, 3840 cpu HPE Sapphire Rapids system
with your patchset applied, as found in:
  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v1

# cat /sys/kernel/debug/x86/topo/cpus/3839
initial_apicid: ff7
apicid: ff7
pkg_id: 31
die_id: 31
cu_id: 255
core_id: 59
logical_pkg_id: 31
logical_die_id: 31
llc_id: 3968
l2c_id: 4086
amd_node_id: 0
amd_nodes_per_pkg: 0
max_cores: 60
max_die_per_pkg: 1
smp_num_siblings: 2


We were also able to boot a 60-socket, 2160 cpu HPE SuperDome Flex (SkyLake)
with the patchset.

# cat /sys/kernel/debug/x86/topo/cpus/2159
initial_apicid: ef7
apicid: ef7
pkg_id: 59
die_id: 59
cu_id: 255
core_id: 27
logical_pkg_id: 59
logical_die_id: 59
llc_id: 3776
l2c_id: 3830
amd_node_id: 0
amd_nodes_per_pkg: 0
max_cores: 18
max_die_per_pkg: 1
smp_num_siblings: 2


On Mon, Jul 24, 2023 at 07:43:50PM +0200, Thomas Gleixner wrote:
> Hi!
> 
> A recent commit to the CPUID leaf 0xb/0x1f parser made me look deeper at
> the way how topology is evaluated. That "fix" is just yet another cure the
> sypmtom hack which completely ignores the underlying disaster.
> 
> The way how topology evaluation works is to overwrite the relevant
> variables as often as possible. E.g. smp_num_siblings gets overwritten a
> gazillion times, which is wrong to begin with. The boot CPU writes it 3
> times, each AP two times.
> 
> What's worse is that this just works by chance on hybrid systems due to the
> fact that the existing ones all seem to boot on a P-Core which has
> SMT. Would it boot on a E-Core which has no SMT, then parts of the early
> topology evaluation including the primary thread mask which is required for
> parallel CPU bringup would be completely wrong. Overwriting it later on
> with the correct value does not help at all.
> 
> What's wrong today with hybrid already is the number of cores per package.
> On an ADL with 8 P-Cores and 8 E-cores the resulting number of cores per
> package is evaluated to be 12. Which is not further surprising because the
> CPUID 0xb/0x1f parser looks at the number of logical processors at core
> level and divides them by the number of SMP siblings.
> 
>    24 / 2 = 12
> 
> Just that this CPU has obviously 16 cores not 12.
> 
> It's is even clearly documented in the SDM that this is wrong.
> 
>  "Bits 15-00: The number of logical processors across all instances of this
>   domain within the next higher- scoped domain relative to this current
>   logical processor. (For example, in a processor socket/package comprising
>   "M" dies of "N" cores each, where each core has "L" processors, the
>   "die" domain subleaf value of this field would be M*N*L. In an asymmetric
>   topology this would be the summation of the value across the lower domain
>   level instances to create each upper domain level instance.) This number
>   reflects configuration as shipped by Intel. Note, software must not use
>   this field to enumerate processor topology.
> 
>   Software must not use the value of EBX[15:0] to enumerate processor topology
>   of the system. The value is only intended for display and diagnostic purposes.
>   The actual number of logical processors available to BIOS/OS/Applications
>   may be different from the value of EBX[15:0], depending on software and
>   platform hardware configurations."
> 
> This "_NOT_ to use for topology evaluation" sentence existed even before
> hybrid came along and got ignored. The code worked by chance, but with
> hybrid all bets are off. The code completely falls apart once CPUID leaf
> 0x1f enumerates any topology level between CORE and DIE, but that's not a
> suprise.
> 
> The proper thing to do is to actually evaluate the full topology including
> the non-present (hotpluggable) CPUs based on the APICIDs which are provided
> by the firmware and a proper topology domain level parser. This can exactly
> tell the number of physical packages, logical packages etc. _before_ even
> booting a single AP. All of that can be evaluated upfront.
> 
> Aside of that there are too many places which do their own topology
> evaluation, but there is absolutely no central point which can actually
> provide all of that information in a consistent way. This needs to change.
> 
> This series implements another piece towards this: sane CPUID evaluation,
> which is done at _one_ place in a proper well defined order instead of
> having it sprinkled all over the CPUID evaluation code.
> 
> At the end of this series this is pretty much bug compatible with the
> current CPUID evaluation code in respect to the cores per package
> evaluation, but it gets rid of overwriting things like smp_num_siblings,
> which is now written once, but is still not capable to work correctly on a
> hybrid machine which boots from a non SMT core. These things can only be
> fixed up in the next step(s).
> 
> When I tried to go further with this I ran into yet another pile of
> historical layers of duct tape and haywire with a gazillion of random
> variables sprinkled all over the place. That's still work in progress.  It
> actually works, but the last step which switches over is not yet in a shape
> that can be easily reviewed. Stay tuned.
> 
> The series is based on the APIC cleanup series:
> 
>   https://lore.kernel.org/lkml/20230724131206.500814398@linutronix.de
> 
> and also available on top of that from git:
> 
>  git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git topo-cpuid-v1
> 
> Thanks,
> 
> 	tglx
> ---
>  arch/x86/kernel/cpu/topology.c              |  168 --------------------
>  b/Documentation/arch/x86/topology.rst       |   12 -
>  b/arch/x86/events/amd/core.c                |    2 
>  b/arch/x86/events/amd/uncore.c              |    2 
>  b/arch/x86/events/intel/uncore.c            |    2 
>  b/arch/x86/include/asm/apic.h               |    1 
>  b/arch/x86/include/asm/cacheinfo.h          |    3 
>  b/arch/x86/include/asm/cpuid.h              |   32 +++
>  b/arch/x86/include/asm/processor.h          |   58 ++++--
>  b/arch/x86/include/asm/smp.h                |    2 
>  b/arch/x86/include/asm/topology.h           |   51 +++++-
>  b/arch/x86/include/asm/x86_init.h           |    2 
>  b/arch/x86/kernel/amd_nb.c                  |    8 
>  b/arch/x86/kernel/apic/apic_flat_64.c       |    7 
>  b/arch/x86/kernel/apic/apic_noop.c          |    3 
>  b/arch/x86/kernel/apic/apic_numachip.c      |   11 -
>  b/arch/x86/kernel/apic/bigsmp_32.c          |    6 
>  b/arch/x86/kernel/apic/local.h              |    1 
>  b/arch/x86/kernel/apic/probe_32.c           |    6 
>  b/arch/x86/kernel/apic/x2apic_cluster.c     |    1 
>  b/arch/x86/kernel/apic/x2apic_phys.c        |    6 
>  b/arch/x86/kernel/apic/x2apic_uv_x.c        |   63 +------
>  b/arch/x86/kernel/cpu/Makefile              |    5 
>  b/arch/x86/kernel/cpu/amd.c                 |  156 ------------------
>  b/arch/x86/kernel/cpu/cacheinfo.c           |   51 ++----
>  b/arch/x86/kernel/cpu/centaur.c             |    4 
>  b/arch/x86/kernel/cpu/common.c              |  108 +-----------
>  b/arch/x86/kernel/cpu/cpu.h                 |   14 +
>  b/arch/x86/kernel/cpu/debugfs.c             |   98 +++++++++++
>  b/arch/x86/kernel/cpu/hygon.c               |  133 ---------------
>  b/arch/x86/kernel/cpu/intel.c               |   38 ----
>  b/arch/x86/kernel/cpu/mce/amd.c             |    4 
>  b/arch/x86/kernel/cpu/mce/apei.c            |    4 
>  b/arch/x86/kernel/cpu/mce/core.c            |    4 
>  b/arch/x86/kernel/cpu/mce/inject.c          |    7 
>  b/arch/x86/kernel/cpu/proc.c                |    8 
>  b/arch/x86/kernel/cpu/topology.h            |   51 ++++++
>  b/arch/x86/kernel/cpu/topology_amd.c        |  179 +++++++++++++++++++++
>  b/arch/x86/kernel/cpu/topology_common.c     |  233 ++++++++++++++++++++++++++++
>  b/arch/x86/kernel/cpu/topology_ext.c        |  136 ++++++++++++++++
>  b/arch/x86/kernel/cpu/zhaoxin.c             |   18 --
>  b/arch/x86/kernel/smpboot.c                 |   64 ++++---
>  b/arch/x86/kernel/vsmp_64.c                 |   13 -
>  b/arch/x86/mm/amdtopology.c                 |   35 +---
>  b/arch/x86/xen/apic.c                       |    8 
>  b/drivers/edac/amd64_edac.c                 |    4 
>  b/drivers/edac/mce_amd.c                    |    4 
>  b/drivers/gpu/drm/amd/amdkfd/kfd_topology.c |    2 
>  b/drivers/hwmon/fam15h_power.c              |    7 
>  b/drivers/scsi/lpfc/lpfc_init.c             |    8 
>  b/drivers/virt/acrn/hsm.c                   |    2 
>  51 files changed, 964 insertions(+), 881 deletions(-)
> 
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ