[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251103143851.GA8673@yaz-khff2.amd.com>
Date: Mon, 3 Nov 2025 09:38:51 -0500
From: Yazen Ghannam <yazen.ghannam@....com>
To: Michal Pecio <michal.pecio@...il.com>
Cc: x86@...nel.org, regressions@...ts.linux.dev,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>, Borislav Petkov <bp@...en8.de>,
Mario Limonciello <mario.limonciello@....com>,
Eric DeVolder <eric.devolder@...cle.com>,
linux-kernel@...r.kernel.org
Subject: Re: AMD topology broken on various 754/AM2+/AM3/AM3+ systems causes
NB/EDAC/GART regression since 6.14
On Mon, Nov 03, 2025 at 08:40:21AM +0100, Michal Pecio wrote:
> On Fri, 24 Oct 2025 17:32:04 -0400, Yazen Ghannam wrote:
> > So far, I think the way to go is add explicit quirk for known issues.
> >
> > Please see the patch below.
> >
> > Thanks,
> > Yazen
> >
> >
> > From eeb0745e973055d8840b536cfa842d6f2bf4ac52 Mon Sep 17 00:00:00 2001
> > From: Yazen Ghannam <yazen.ghannam@....com>
> > Date: Fri, 24 Oct 2025 21:19:26 +0000
> > Subject: [PATCH] x86/topology: Add helper to ignore bogus MADT entries
> >
> > Some older Intel and AMD systems include bogus ACPI MADT entries. These
> > entries show as "disabled". And it's not clear if they are physically
> > present but offline, i.e halted. Or if they are not physically present
> > at all.
> >
> > Ideally, if they are not physically present, then they should not be
> > listed in MADT. There doesn't seem to be any explicit x86 topology info
> > that can be used to verify if the entries are bogus or not.
> >
> > Add a helper function to collect vendor-specific checks to ignore bogus
> > APIC IDs. Start with known quirks for an Intel SNB model and older AMD
> > K10 models.
> >
> > Fixes: f0551af02130 ("x86/topology: Ignore non-present APIC IDs in a present package")
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
> > ---
> > arch/x86/kernel/cpu/topology.c | 52 ++++++++++++++++++++++++++--------
> > 1 file changed, 40 insertions(+), 12 deletions(-)
> >
> > diff --git a/arch/x86/kernel/cpu/topology.c b/arch/x86/kernel/cpu/topology.c
> > index 6073a16628f9..704788b92395 100644
> > --- a/arch/x86/kernel/cpu/topology.c
> > +++ b/arch/x86/kernel/cpu/topology.c
> > @@ -219,6 +219,45 @@ static unsigned int topo_unit_count(u32 lvlid, enum x86_topology_domains at_leve
> > return cnt;
> > }
> >
> > +/*
> > + * Some older BIOSes include extra entries in MADT.
> > + * Do some vendor-specific checks to ignore them.
> > + */
> > +static bool ignore_extra_apic_entry(u32 apic_id)
> > +{
> > + u32 pkgid = topo_apicid(apic_id, TOPO_PKG_DOMAIN);
> > + struct cpuinfo_x86 *c = &boot_cpu_data;
> > +
> > + /* Allow "physically not possible" cases if in a guest. */
> > + if (!hypervisor_is_type(X86_HYPER_NATIVE))
> > + return false;
> > +
> > + /* This model only supports 8 threads in a package. */
> > + if (c->x86_vendor == X86_VENDOR_INTEL &&
> > + c->x86 == 0x6 && c->x86_model == 0x2d) {
> > + if (topo_unit_count(pkgid, TOPO_PKG_DOMAIN, phys_cpu_present_map) >= 8)
>
> Looks like possible functional change compared to the code it replaces,
> perhaps it should be a separate patch?
>
Yes, that's right. I just wanted to put everything together as a quick
check. We can clean it up if things work.
> > + goto reject;
> > + }
> > +
> > + /*
> > + * Various older models have extra entries. A common trait is that the
> > + * package ID derived from the APIC ID would be more than was ever supported.
> > + */
> > + if (c->x86_vendor == X86_VENDOR_AMD && c->x86 < 0x17) {
> > + pkgid >>= x86_topo_system.dom_shifts[TOPO_PKG_DOMAIN - 1];
> > +
> > + if (pkgid >= 8)
> > + goto reject;
>
> Yes, this works. The excess entries are counted as "rejected" rather
> than "hotpluggable", package count is correct and EDAC/GART are back.
>
Cool, thanks for confirming.
> But to be exact, the known case is (apic_id >= 0x80) rather than
> (pkgid >= 8). The latter assumes not only that there are no more than
> 8 packages, but also that their IDs start from zero.
>
Yes, that's right. I see these checks as quirks for old systems only. We
just need catch the bogus cases. These shouldn't be generally
applicable. Though it'd be good to catch as many cases in as few
conditions as possible. My thinking is we start with more narrow
conditions and expand them if there are other reported cases.
> I have this AM4 system with some proprietary HP BIOS:
>
> [02Fh 0047 001h] Local Apic ID : 10
> [037h 0055 001h] Local Apic ID : 11
> [03Fh 0063 001h] Local Apic ID : 12
> [047h 0071 001h] Local Apic ID : 13
>
> domain: Thread shift: 0 dom_size: 1 max_threads: 1
> domain: Core shift: 4 dom_size: 16 max_threads: 16
> domain: Module shift: 4 dom_size: 1 max_threads: 16
> domain: Tile shift: 4 dom_size: 1 max_threads: 16
> domain: Die shift: 4 dom_size: 1 max_threads: 16
> domain: DieGrp shift: 4 dom_size: 1 max_threads: 16
> domain: Package shift: 4 dom_size: 1 max_threads: 16
>
> It seems that pkgid is 0x1 here, which is not a problem because
> it's single socket, but dunno if HP or somebody else couldn't do
> similar things in an 8-socket system and end up with pkgid > 8.
>
So is this another bogus case?
And, if so, it works because of the "ignore hotpluggable in present
package" check?
>
> By the way, do you know what's the reason I don't have /sys directories
> for those phantom CPUs (before this patch) and should I have them if
> they were legitimate hotplug CPUs?
>
> Maybe there is already some check performed in other part of the kernel
> which rejects those CPUs and which could be replicated here.
>
Sorry, I don't know. Maybe others can comment if they have any pointers.
Thanks,
Yazen
Powered by blists - more mailing lists