[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN6PR1201MB0131DADDCB1DE3C789EF2173F80A0@BN6PR1201MB0131.namprd12.prod.outlook.com>
Date: Wed, 5 Apr 2017 14:52:08 +0000
From: "Ghannam, Yazen" <Yazen.Ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
CC: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
Tony Luck <tony.luck@...el.com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 1/2] x86/mce/AMD: Redo use of SMCA MCA_DE{STAT,ADDR}
registers
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@...en8.de]
> Sent: Wednesday, April 05, 2017 9:40 AM
> To: Ghannam, Yazen <Yazen.Ghannam@....com>
> Cc: linux-edac@...r.kernel.org; Tony Luck <tony.luck@...el.com>;
> x86@...nel.org; linux-kernel@...r.kernel.org
> Subject: Re: [PATCH v2 1/2] x86/mce/AMD: Redo use of SMCA
> MCA_DE{STAT,ADDR} registers
>
> On Tue, Apr 04, 2017 at 12:24:31PM -0500, Yazen Ghannam wrote:
> > From: Yazen Ghannam <yazen.ghannam@....com>
> >
> > We have support for the new SMCA MCA_DE{STAT,ADDR} registers in Linux.
> > So we've used these registers in place of MCA_{STATUS,ADDR} on SMCA
> systems.
> > However, the guidance for current implementations of SMCA is to
> > continue using MCA_{STATUS,ADDR} and to use MCA_DE{STAT,ADDR} only if
> > a Deferred error was not found in the former registers. This also
> > means we shouldn't clear MCA_CONFIG[LogDeferredInMcaStat].
>
> Does that mean we should trust what BIOS has set it to?
>
Yes, BIOS does and should set this bit. If it doesn't for some reason then our
handling in the Kernel will still be functional. We just won't find Deferred errors
in MCA_STATUS.
> >
> > Redo the AMD Deferred error interrupt handler to follow the guidance
> > for current SMCA systems. Also, don't break after finding the first error.
> >
> > Rework __log_error() for clarity.
> >
> > Don't clear MCA_CONFIG[LogDeferredInMcaStat] during AMD mcheck init.
> >
> > Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
> > ---
>
> ...
>
> > @@ -832,25 +821,29 @@ asmlinkage __visible void __irq_entry
> smp_trace_deferred_error_interrupt(void)
> > exiting_ack_irq();
> > }
> >
> > +static inline bool is_deferred_error(u64 status) {
> > + return ((status & MCI_STATUS_VAL) && (status &
> > +MCI_STATUS_DEFERRED)); }
> > +
> > /* APIC interrupt handler for deferred errors */ static void
> > amd_deferred_error_interrupt(void)
> > {
> > unsigned int bank;
> > - u32 msr_status;
> > u64 status;
> >
> > for (bank = 0; bank < mca_cfg.banks; ++bank) {
> > - msr_status = (mce_flags.smca) ?
> MSR_AMD64_SMCA_MCx_DESTAT(bank)
> > - : msr_ops.status(bank);
> > + rdmsrl(msr_ops.status(bank), status);
> >
> > - rdmsrl(msr_status, status);
> > + if (is_deferred_error(status)) {
> > + __log_error_deferred(bank, false);
> >
> > - if (!(status & MCI_STATUS_VAL) ||
> > - !(status & MCI_STATUS_DEFERRED))
> > - continue;
> > + } else if (mce_flags.smca) {
> > + rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank),
> status);
> >
> > - __log_error(bank, true, false, 0);
> > - break;
> > + if (is_deferred_error(status))
> > + __log_error_deferred(bank, true);
> > + }
>
> The diff is very hard to parse, lemme paste the whole resulting function after
> the patch:
>
> > /* APIC interrupt handler for deferred errors */ static void
> > amd_deferred_error_interrupt(void)
> > {
> > unsigned int bank;
> > u64 status;
> >
> > for (bank = 0; bank < mca_cfg.banks; ++bank) {
> > rdmsrl(msr_ops.status(bank), status);
> >
> > if (is_deferred_error(status)) {
> > __log_error_deferred(bank, false);
> >
> > } else if (mce_flags.smca) {
> > rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank),
> > status);
> >
> > if (is_deferred_error(status))
> > __log_error_deferred(bank, true);
>
> So this doesn't sound like the guidance: if we should fallback to the
> DE* versions only if a deferred error wasn't found in the usual regs, the logic
> should be:
>
> static void __log_error_deferred(unsigned int bank) {
> u32 msr_stat = msr_ops.status(bank);
> u32 msr_addr = msr_ops.addr(bank);
>
> /*
> * __log_error() needs to return true to say it has logged successfully
> or false
> * if it hasn't.
> */
> if (__log_error(bank, msr_stat, msr_addr, 0))
> goto out;
>
I'd rather we keep as many checks as possible out of __log_error().
> if (!mce_flags.smca)
> return;
>
> /*
> * Couldn't log a deferred error from the usual regs, fallback to DE*
> * variants.
> */
> msr_stat = MSR_AMD64_SMCA_MCx_DESTAT(bank);
> msr_addr = MSR_AMD64_SMCA_MCx_DEADDR(bank);
>
> __log_error(bank, msr_stat, msr_addr, 0);
>
> out:
> /* We should still clear MCA_DESTAT even if we used MCA_STATUS. */
> if (mce_flags.smca)
> wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank), 0); }
>
> and this way you don't need to wiggle around that use_smca_destat thing.
>
Your suggestion gave me an idea. Let's drop __log_error_deferred() and just
select the correct registers in the deferred error interrupt handler.
/*
* APIC interrupt handler for deferred errors
*
* We have three scenarios for checking for Deferred errors.
* 1) Non-SMCA systems check MCA_STATUS and log error if found.
* 2) SMCA systems check MCA_STATUS. If error is found then log it and also
* clear MCA_DESTAT.
* 3) SMCA systems check MCA_DESTAT, if error was not found in MCA_STATUS, and
* log it.
*/
static void amd_deferred_error_interrupt(void)
{
unsigned int bank;
u64 status;
for (bank = 0; bank < mca_cfg.banks; ++bank) {
rdmsrl(msr_ops.status(bank), status);
if (is_deferred_error(status)) {
__log_error(bank, msr_ops.status(bank), msr_ops.addr(bank), 0);
/* Clear MCA_DESTAT even if we used MCA_STATUS. */
if (mce_flags.smca)
wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank), 0);
} else if (mce_flags.smca) {
rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank), status);
if (is_deferred_error(status))
__log_error(bank, MSR_AMD64_SMCA_MCx_DESTAT(bank), MSR_AMD64_SMCA_MCx_DEADDR(bank), 0);
}
}
}
What do you think?
Thanks,
Yazen
Powered by blists - more mailing lists