[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ac85fe73b3d7f46ce96d5033f8cf58a1b6b001f3.camel@intel.com>
Date: Tue, 15 Oct 2024 03:23:27 +0000
From: "Zhang, Rui" <rui.zhang@...el.com>
To: "jmattson@...gle.com" <jmattson@...gle.com>
CC: "ajorgens@...gle.com" <ajorgens@...gle.com>, "myrade@...gle.com"
<myrade@...gle.com>, "bp@...en8.de" <bp@...en8.de>, "x86@...nel.org"
<x86@...nel.org>, "peterz@...radead.org" <peterz@...radead.org>, "Tang, Feng"
<feng.tang@...el.com>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "tglx@...utronix.de" <tglx@...utronix.de>,
"Wysocki, Rafael J" <rafael.j.wysocki@...el.com>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>, "jay.chen@....com"
<jay.chen@....com>, "vladteodor@...gle.com" <vladteodor@...gle.com>,
"jon.grimm@....com" <jon.grimm@....com>
Subject: Re: [RFC PATCH] x86/acpi: Ignore invalid x2APIC entries
On Mon, 2024-10-14 at 11:00 -0700, Jim Mattson wrote:
> On Mon, Oct 14, 2024 at 6:05 AM Zhang, Rui <rui.zhang@...el.com>
> wrote:
> >
> > > > >
> > > > > TBH, I'm not sure that there is actually anything wrong with
> > > > > the
> > > > > new
> > > > > numbering scheme.
> > > > > The topology is reported correctly (e.g. in
> > > > > /sys/devices/system/cpu/cpu0/topology/thread_siblings_list).
> > > > > Yet,
> > > > > the
> > > > > new enumeration does seem to contradict user expectations.
> > > > >
> > > >
> > > > Well, we can say this is a violation of the ACPI spec.
> > > > "OSPM should initialize processors in the order that they
> > > > appear in
> > > > the
> > > > MADT." even for interleaved LAPIC and X2APIC entries.
> > >
> > > Ah. Thanks. I didn't know that.
> > >
> > > > Maybe we need two steps for LAPIC/X2APIC parsing.
> > > > 1. check if there is valid LAPIC entry by going through all
> > > > LAPIC
> > > > entries first
> > > > 2. parse LAPIC/X2APIC strictly following the order in MADT.
> > > > (like
> > > > we do
> > > > before)
> > >
> > > That makes sense to me.
> > >
> > > Thanks,
> > >
> > > --jim
> >
> > Hi, Jim,
> >
> > Please check if below patch restores the CPU IDs or not.
> >
> > thanks,
> > rui
> >
> > From ec786dfe693cad2810b54b0d8afbfc7e4c4b3f8a Mon Sep 17 00:00:00
> > 2001
> > From: Zhang Rui <rui.zhang@...el.com>
> > Date: Mon, 14 Oct 2024 13:26:55 +0800
> > Subject: [PATCH] x86/acpi: Fix LAPIC/x2APIC parsing order
> >
> > On some systems, the same CPU (with same APIC ID) is assigned with
> > a
> > different logical CPU id after commit ec9aedb2aa1a ("x86/acpi:
> > Ignore
> > invalid x2APIC entries").
> >
> > This means Linux enumerates the CPUs in a different order and it is
> > a
> > violation of
> > https://uefi.org/specs/ACPI/6.5/05_ACPI_Software_Programming_Model.html#madt-processor-local-apic-sapic-structure-entry-order
> > ,
> >
> > "OSPM should initialize processors in the order that they appear
> > in
> > the MADT"
> >
> > The offending commit wants to ignore x2APIC entries with APIC ID <
> > 255
> > when valid LAPIC entries exist, so it parses all LAPIC entries
> > before
> > parsing any x2APIC entries. This breaks the CPU enumeration order
> > for
> > systems that have x2APIC entries listed before LAPIC entries in
> > MADT.
> >
> > Fix the problem by checking the valid LAPIC entries separately,
> > before
> > parsing any LAPIC/x2APIC entries.
> >
> > Cc: stable@...r.kernel.org
> > Reported-by: Jim Mattson <jmattson@...gle.com>
> > Closes:
> > https://lore.kernel.org/all/20241010213136.668672-1-jmattson@google.com/
> > Fixes: ec9aedb2aa1a ("x86/acpi: Ignore invalid x2APIC entries")
> > Signed-off-by: Zhang Rui <rui.zhang@...el.com>
> > ---
> > arch/x86/kernel/acpi/boot.c | 50
> > +++++++++++++++++++++++++++++++++----
> > 1 file changed, 45 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/x86/kernel/acpi/boot.c
> > b/arch/x86/kernel/acpi/boot.c
> > index 4efecac49863..c70b86f1f295 100644
> > --- a/arch/x86/kernel/acpi/boot.c
> > +++ b/arch/x86/kernel/acpi/boot.c
> > @@ -226,6 +226,28 @@ acpi_parse_x2apic(union acpi_subtable_headers
> > *header, const unsigned long end)
> > return 0;
> > }
> >
> > +static int __init
> > +acpi_check_lapic(union acpi_subtable_headers *header, const
> > unsigned long end)
> > +{
> > + struct acpi_madt_local_apic *processor = NULL;
> > +
> > + processor = (struct acpi_madt_local_apic *)header;
> > +
> > + if (BAD_MADT_ENTRY(processor, end))
> > + return -EINVAL;
> > +
> > + /* Ignore invalid ID */
> > + if (processor->id == 0xff)
> > + return 0;
> > +
> > + /* Ignore processors that can not be onlined */
> > + if (!acpi_is_processor_usable(processor->lapic_flags))
> > + return 0;
> > +
> > + has_lapic_cpus = true;
> > + return 0;
> > +}
> > +
> > static int __init
> > acpi_parse_lapic(union acpi_subtable_headers * header, const
> > unsigned long end)
> > {
> > @@ -257,7 +279,6 @@ acpi_parse_lapic(union acpi_subtable_headers *
> > header, const unsigned long end)
> > processor->processor_id, /* ACPI ID
> > */
> > processor->lapic_flags &
> > ACPI_MADT_ENABLED);
> >
> > - has_lapic_cpus = true;
> > return 0;
> > }
> >
> > @@ -1029,6 +1050,8 @@ static int __init
> > early_acpi_parse_madt_lapic_addr_ovr(void)
> > static int __init acpi_parse_madt_lapic_entries(void)
> > {
> > int count, x2count = 0;
> > + struct acpi_subtable_proc madt_proc[2];
> > + int ret;
> >
> > if (!boot_cpu_has(X86_FEATURE_APIC))
> > return -ENODEV;
> > @@ -1037,10 +1060,27 @@ static int __init
> > acpi_parse_madt_lapic_entries(void)
> > acpi_parse_sapic,
> > MAX_LOCAL_APIC);
> >
> > if (!count) {
> > - count =
> > acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC,
> > - acpi_parse_lapic,
> > MAX_LOCAL_APIC);
> > - x2count =
> > acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_X2APIC,
> > - acpi_parse_x2apic,
> > MAX_LOCAL_APIC);
>
> The point is moot now, but I don't think the previous code did the
> right thing when acpi_table_parse_madt() returned a negative value
> (for errors).
Previous and current code checks for the negative value later after
parsing both LAPIC and x2APIC.
so what is the problem you're referring to?
Do you mean we should error out immediately when parsing LAPIC fails?
>
> > + /* Check if there are valid LAPIC entries */
> > + acpi_table_parse_madt(ACPI_MADT_TYPE_LOCAL_APIC,
> > acpi_check_lapic, MAX_LOCAL_APIC);
>
> Two comments:
>
> 1) Should we check for a return value < 0 here, or just wait for one
> of the later walks to error out?
I'm okay with both.
> 2) It seems unfortunate to walk the entire table when the first entry
> may give you the answer, but perhaps modern systems have only X2APIC
> entries, so we will typically have to walk the entire table anyway.
yeah. There are systems with invalid LAPIC entries first, and
acpi_parse_entries_array() doesn't support graceful early termination,
so we have to check all the entries.
>
> Reviewed-and-tested-by: Jim Mattson <jmattson@...gle.com>
Thanks. I will submit the current version to keep your tags.
-rui
Powered by blists - more mailing lists