lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a0242c82631b19f3de7a223d8dd38f21308cd3cc.camel@intel.com>
Date:   Sat, 18 Feb 2023 16:11:33 +0000
From:   "Zhang, Rui" <rui.zhang@...el.com>
To:     "tglx@...utronix.de" <tglx@...utronix.de>,
        "hpa@...or.com" <hpa@...or.com>,
        "mingo@...hat.com" <mingo@...hat.com>,
        "Hansen, Dave" <dave.hansen@...el.com>,
        "bp@...en8.de" <bp@...en8.de>,
        "dave.hansen@...ux.intel.com" <dave.hansen@...ux.intel.com>
CC:     "Brown, Len" <len.brown@...el.com>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "zhang.jia@...ux.alibaba.com" <zhang.jia@...ux.alibaba.com>,
        "x86@...nel.org" <x86@...nel.org>
Subject: Re: [PATCH 1/1] x86/topology: fix erroneous smp_num_siblings on Intel
 Hybrid platform

Hi, Dave,

Thanks for reviewing this.

On Fri, 2023-02-17 at 10:03 -0800, Dave Hansen wrote:
> On 2/17/23 08:37, Zhang Rui wrote:
> > The SMT siblings value returned by CPUID.1F SMT level EBX differs
> > among CPUs on Intel Hybrid platforms like AlderLake and MeteorLake.
> > It returns 2 for Pcore CPUs which have SMT siblings and returns 1
> > for
> > Ecore CPUs which do not have SMT siblings.
> > 
> > Today, the CPU boot code sets the global variable smp_num_siblings
> > when
> > every CPU thread is brought up. The last thread to boot will
> > overwrite
> > it with the number of siblings of *that* thread. That last thread
> > to
> > boot will "win". If the thread is a Pcore, smp_num_siblings ==
> > 2.  If it
> > is an Ecore, smp_num_siblings == 1.
> > 
> > smp_num_siblings describes if the *system* supports SMT.  It should
> > specify the maximum number of SMT threads among all cores.
> 
> I was with you until here, but I'm having a hard time parsing this:
> 
> > On AlderLake-P/S platforms, it does not cause any functional issues
> > so
> > far.
> > But on MeteorLake-P platform, when probing an Ecore CPU,
> > a). smp_num_siblings varies like AlderLake and it is set to 1 for
> > Ecore.
> > b). x86_max_cores is totally broken and it is set to 1 for the boot
> > cpu.
> > Altogether, these two issues make the system being treated as an UP
> > system in set_cpu_sibling_map() when probing Ecore CPUs, and the
> > Ecore
> > CPUs are not updated in any cpu sibling maps erroneously.
> 
> Let's try and focus this changelog on the problem at hand which is a
> broken smp_num_siblings on MeterorLake.  Right?

yes. I totally agree with this.

But when showing the (cpu topology info and lscpu) problem below, I
want to deliver a clear message that
1. there are two bugs and *both* of them are required in order to
   trigger the problem
2. this patch just fixes one of the bugs

Do you mean that I don't need to mention the x86_max_cores issue here?

> 
> > Below shows part of the CPU topology information before and after
> > the
> > fix, for both Pcore and Ecore CPU (cpu0 is Pcore, cpu 12 is Ecore).
> > ...
> > -/sys/devices/system/cpu/cpu0/topology/package_cpus:000fff
> > -/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-11
> > +/sys/devices/system/cpu/cpu0/topology/package_cpus:3fffff
> > +/sys/devices/system/cpu/cpu0/topology/package_cpus_list:0-21
> > ...
> > -/sys/devices/system/cpu/cpu12/topology/package_cpus:001000
> > -/sys/devices/system/cpu/cpu12/topology/package_cpus_list:12
> > +/sys/devices/system/cpu/cpu12/topology/package_cpus:3fffff
> > +/sys/devices/system/cpu/cpu12/topology/package_cpus_list:0-21
> > 
> > And this also breaks userspace tools like lscpu
> > -Core(s) per socket:  1
> > -Socket(s):           11
> > +Core(s) per socket:  16
> > +Socket(s):           1
> 
> Heh, yeah, 11 sockets is a tiny bug.
> 
> > To fix the first issue, ensure that smp_num_siblings represents the
> > system-wide maximum number of siblings by always increasing its
> > value.
> > Never allow it to decrease.
> > 
> > Note that this fix is sufficient to make set_cpu_sibling_map() work
> > correctly. And how to fix the bogus cpuinfo_x86.x86_max_cores will
> > be
> > addressed separately.
> 
> Having this note here is probably OK.  But, I'm not sure even
> mentioning
> x86_max_cores is worth it.  Doesn't it just add confusion?

Even without the CPU topology problem we see, changing smp_num_siblings
from a larger value to a smaller value is wrong. In that sense, I can
remove this from the changelog.

thanks,
rui

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ