lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BYAPR21MB168857E160C70BC76665FE6FD705A@BYAPR21MB1688.namprd21.prod.outlook.com>
Date:   Mon, 31 Jul 2023 16:25:41 +0000
From:   "Michael Kelley (LINUX)" <mikelley@...rosoft.com>
To:     Thomas Gleixner <tglx@...utronix.de>,
        LKML <linux-kernel@...r.kernel.org>
CC:     "x86@...nel.org" <x86@...nel.org>,
        Tom Lendacky <thomas.lendacky@....com>,
        Andrew Cooper <andrew.cooper3@...rix.com>,
        Arjan van de Ven <arjan@...ux.intel.com>,
        "James E.J. Bottomley" <jejb@...ux.ibm.com>,
        Dick Kennedy <dick.kennedy@...adcom.com>,
        James Smart <james.smart@...adcom.com>,
        "Martin K. Petersen" <martin.petersen@...cle.com>,
        "linux-scsi@...r.kernel.org" <linux-scsi@...r.kernel.org>,
        Guenter Roeck <linux@...ck-us.net>,
        "linux-hwmon@...r.kernel.org" <linux-hwmon@...r.kernel.org>,
        Jean Delvare <jdelvare@...e.com>,
        Huang Rui <ray.huang@....com>, Juergen Gross <jgross@...e.com>,
        Steve Wahl <steve.wahl@....com>,
        Mike Travis <mike.travis@....com>,
        Dimitri Sivanich <dimitri.sivanich@....com>,
        Russ Anderson <russ.anderson@....com>
Subject: RE: [patch v2 21/38] x86/cpu: Provide cpu_init/parse_topology()

From: Thomas Gleixner <tglx@...utronix.de> Sent: Monday, July 31, 2023 5:35 AM
> 
> On Mon, Jul 31 2023 at 04:05, Michael Kelley wrote:
> >> +	/*
> >> +	 * The initial invocation from early_identify_cpu() happens before
> >> +	 * the APIC is mapped or X2APIC enabled. For establishing the
> >> +	 * topology, that's not required. Use the initial APIC ID.
> >> +	 */
> >> +	if (early)
> >> +		c->topo.apicid = c->topo.initial_apicid;
> >> +	else
> >> +		c->topo.apicid = read_apic_id();
> >
> > Using the value from the local APIC ID reg turns out to cause a problem in
> > some Hyper-V VM configurations.  If a VM has multiple L3 caches (probably
> > due to multiple NUMA nodes) and the # of CPUs in the span of the L3 cache
> > is not a power of 2, the APIC IDs for the CPUs in the span of the 1st L3 cache
> > are sequential starting with 0.  But then there is a gap before starting the
> > APIC IDs for the CPUs in the span of the 2nd L3 cache.  The gap is
> > repeated if there are additional L3 caches.
> >
> > The CPUID instruction executed on a guest vCPU correctly reports the APIC
> > IDs.  However, the ACPI MADT assigns the APIC IDs sequentially with no
> > gaps, and the guest firmware sets the APIC_ID register for each local APIC
> > to match the MADT.  When parse_topology() sets the apicid field based on
> > reading the local APIC ID register, the value it sets is different from the
> > initial_apicid value for CPUs in the span of the 2nd and subsequent L3
> > caches, because there's no gap in the APIC IDs read from the local APIC.
> > Linux boots and runs, but the topology is set up with the wrong span for
> > the L3 cache and for the associated scheduling domains.
> 
> TBH. That's an insanity. MADT and the actual APIC ID determine the
> topology. So the gaps should be reflected in MADT and the actual APIC
> IDs should be set correctly if the intent is to provide topology
> information.
> 
> Just for the record. This hack works only on Intel today, because AMD
> init sets topo.apicid = read_apic_id() unconditionally. So this is
> inconsistent already, no?
> 

Correct.  But given that the L3 cache span in the AMD Zen1 and Zen2
processors is only 8 CPUs, there's much less reason to configure a VM
that only uses some of the CPUs in an L3 cache span.  Hyper-V does
the APIC ID numbering correctly for Zen3 with its 16 CPUs in the L3
cache span.

> > The old code derives the apicid from the initial_apicid via the
> > phys_pkg_id() callback, so these bad Hyper-V VM configs skate by.  The
> > wrong value in the local APIC ID register and MADT does not affect
> > anything, except that the check in validate_apic_and_package_id() fails
> > during boot, and a set of "Firmware bug:" messages is correctly
> > output.
> 
> So instead of fixing the firmware bugs, hyper-v just moves on and
> pretends that everything works fine, right?

What can I say. :-(

> 
> > Three thoughts:
> >
> > 1)  Are Hyper-V VMs the only place where the local APIC ID register might
> > have a bogus value?  Probably so, but you never know what might crawl
> > out.
> 
> Define bogus. MADT is the primary source of information because that's
> how we know how many CPUs (APICs) are there and what their APIC ID is
> which we can use to wake them up. So there is a reasonable expectation
> that this information is consistent with the rest of the system.

Commit d49597fd3bc7 "x86/cpu: Deal with broken firmware (VMWare/Xen)"
mentions VMware and XEN implementations that violate the spec.  The
commit is from late 2016.  Have these bad systems aged out and no longer
need accommodation?

> 
> The Intel SDM clearly says in Vol 3A section 9.4.5 Identifying Logical
> Processors in an MP System:
> 
>   "After the BIOS has completed the MP initialization protocol, each
>    logical processor can be uniquely identified by its local APIC
>    ID. Software can access these APIC IDs in either of the following
>    ways:"
> 
> These ways include read from APIC, read MADT, read CPUID and implies
> that this must be consistent. For X2APIC it's actually written out:
> 
>   "If the local APIC unit supports x2APIC and is operating in x2APIC
>    mode, 32-bit APIC ID can be read by executing a RDMSR instruction to
>    read the processor’s x2APIC ID register. This method is equivalent to
>    executing CPUID leaf 0BH described below."
> 
> AMD has not been following that in the early 64bit systems as they moved
> the APIC ID space to start at 32 for the first CPU in the first socket
> for whatever reasons. But since then the kernel reads back the APIC ID
> on AMD systems into topo.apicid. But that was long ago and can easily be
> dealt with because at least the real APIC ID and the MADT/MPTABLE
> entries are consistent.
> 
> Hypervisors have their own CPUID space to override functionality with
> their own magic stuff, but imposing their nutbolt ideas on the
> architectural part of the system is not only wrong, it's disrespectful
> against the OS developers who try to keep their system sane.
> 
> > 2) The natural response is "Well, fix Hyper-V!"  I first had this conversation
> > with the Hyper-V team about 5 years ago.  Some cases of the problem were
> > fixed, but some cases remain unfixed.  It's a long story.
> >
> > 3)  Since Hyper-V code in Linux already has an override for the apic->read()
> > function, it's possible to do a hack in that override so that apicid gets set to
> > the same value as initial_apicid, which matches the old code.  Here's the diff:
> 
> This collides massively with the other work I'm doing, which uses the
> MADT provided information to actually evaluate various topology related
> things upfront and later during bringup. Thats badly needed because lots
> of todays infrastructure is based on heuristics and guesswork.

Fair enough.  And I've re-raised the issue with the Hyper-V team.

> 
> But it seems I wasted a month on reworking all of this just to be
> stopped cold in the tracks by completely undocumented and unnecessary
> hyper-v abuse.
> 
> So if Hyper-V insists on abusing the initial APIC ID as read from CPUID
> for topology information related to L3, then hyper-v should override the
> cache topology mechanism and not impose this insanity on the basic
> topology evaluation infrastructure.
> 
> Yours seriously grumpy
> 
>       tglx

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ