[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e0d10606-4472-4cde-b55d-34180efad42b@amd.com>
Date: Thu, 25 Apr 2024 10:12:44 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Borislav Petkov <bp@...en8.de>, robert.richter@....com
Cc: yazen.ghannam@....com, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, tony.luck@...el.com, x86@...nel.org,
Avadhut.Naik@....com, John.Allen@....com
Subject: Re: [PATCH v2 07/16] x86/mce/amd: Simplify DFR handler setup
On 4/24/2024 3:06 PM, Borislav Petkov wrote:
> On Thu, Apr 04, 2024 at 10:13:50AM -0500, Yazen Ghannam wrote:
>> AMD systems with the SUCCOR feature can send an APIC LVT interrupt for
>> deferred errors. The LVT offset is 0x2 by convention, i.e. this is the
>> default as listed in hardware documentation.
>>
>> However, the MCA registers may list a different LVT offset for this
>> interrupt. The kernel should honor the value from the hardware.
>
> There's this "may" thing again.
>
Right, I should say "the microarchitecture allows it". :)
> Is this enablement for some future hw too or do you really trust the
> value in MSR_CU_DEF_ERR is programmed correctly in all cases?
>
I trust the value from hardware.
The intention here is to simplify the code for general maintenance and to make
later patches easier.
>> Simplify the enable flow by using the hardware-provided value. Any
>> conflicts will be caught by setup_APIC_eilvt(). Conflicts on production
>> systems can be handled as quirks, if needed.
>
> Well, which systems support succor?
>
> I'd like to test this on them before we face all the quirkery. :)
>
All Zen/SMCA systems. I don't recall any issues in this area.
Some later Family 15h systems (Carrizo?) had it. But I don't know if it was
used in production. It was slightly before my time.
> That area has been plagued by hw snafus if you look at
> setup_APIC_eilvt() and talk to uncle Robert. :-P
>
Right, I found this:
27afdf2008da ("apic, x86: Use BIOS settings for IBS and MCE threshold
interrupt LVT offsets")
Which is basically the same idea: use what is in the register.
But it looks there was an issue with IBS on Family 10h.
Is this the only case of a real issue? If so, then why apply this method to
the THR and DFR interrupts?
Robert, were there any other issues?
>> @@ -595,17 +584,15 @@ static void deferred_error_interrupt_enable(struct cpuinfo_x86 *c)
>> if (rdmsr_safe(MSR_CU_DEF_ERR, &low, &high))
>> return;
>>
>> + /*
>> + * Trust the value from hardware.
>> + * If there's a conflict, then setup_APIC_eilvt() will throw an error.
>> + */
>> def_new = (low & MASK_DEF_LVTOFF) >> 4;
>> - if (!(low & MASK_DEF_LVTOFF)) {
>> - pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
>> - def_new = DEF_LVT_OFF;
>> - low = (low & ~MASK_DEF_LVTOFF) | (DEF_LVT_OFF << 4);
>> - }
>> + if (setup_APIC_eilvt(def_new, DEFERRED_ERROR_VECTOR, APIC_EILVT_MSG_FIX, 0))
>> + return;
>>
>> - def_offset = setup_APIC_deferred_error(def_offset, def_new);
>> - if ((def_offset == def_new) &&
>> - (deferred_error_int_vector != amd_deferred_error_interrupt))
>> - deferred_error_int_vector = amd_deferred_error_interrupt;
>
> There was a reason for that - deferred_error_int_vector is a global var
> and you're calling enable_deferred_error_interrupt() on each CPU.
>
Right, and all CPUs should use the same APIC LVT offset. If they differ, then
setup_APIC_eilvt() will fail above and return.
Why check "if X != Y, then X = Y"? Why not just unconditionally do "X = Y"?
>> + deferred_error_int_vector = amd_deferred_error_interrupt;
>>
>> if (!mce_flags.smca)
>> low = (low & ~MASK_DEF_INT_TYPE) | DEF_INT_TYPE_APIC;
>
> Thx.
>
Thanks,
Yazen
Powered by blists - more mailing lists