[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <BN6PR1201MB01317CEF66FCA08CF2843663F80C0@BN6PR1201MB0131.namprd12.prod.outlook.com>
Date: Fri, 7 Apr 2017 20:37:03 +0000
From: "Ghannam, Yazen" <Yazen.Ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
CC: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
Tony Luck <tony.luck@...el.com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: RE: [PATCH v2 1/2] x86/mce/AMD: Redo use of SMCA MCA_DE{STAT,ADDR}
registers
> -----Original Message-----
> From: Borislav Petkov [mailto:bp@...en8.de]
> Sent: Wednesday, April 05, 2017 4:05 PM
> To: Ghannam, Yazen <Yazen.Ghannam@....com>
> Cc: linux-edac@...r.kernel.org; Tony Luck <tony.luck@...el.com>;
> x86@...nel.org; linux-kernel@...r.kernel.org
> Subject: Re: [PATCH v2 1/2] x86/mce/AMD: Redo use of SMCA
> MCA_DE{STAT,ADDR} registers
>
> On Wed, Apr 05, 2017 at 07:29:48PM +0000, Ghannam, Yazen wrote:
> > If it's set, then I expect a Deferred error in MCA_STATUS since any
> > Correctable Errors will be overwritten. Multiple bank types can
> > generate Deferred errors so there may also be cases where for some
> > types a valid Uncorrectable error happens and overwrites the Deferred
> > error before we can handle it. In which case we lose the Deferred error if
> we don't check MCA_DESTAT.
>
> So if we have an UE, wouldn't that raise an #MC? I guess in such cases we
> should concentrate only on the deferred errors and let the #MC handler deal
> with them. As we do now.
>
Right.
> > If it's not set, then it's possible to have a valid Correctable error
> > in MCA_STATUS while the valid Deferred error is in MCA_DESTAT.
>
> What's logging the CE? We probably should log it too before something
> overwrites it.
>
CEs will get picked up by polling or if we hit a threshold.
> Anyway, ok, I think I know what needs to happen now:
>
> amd_deferred_error_interrupt:
>
> if (__log_error_deferred(bank))
> return;
>
> This one read MC?_STATUS and does the logging for when the deferred error
> is in the normal MSRs. It returns true if it succeeded. It reads and hands down
> both MC?_STATUS and MC?_ADDR to __log_error() so that it doesn't have to
> read MC?_STATUS twice.
>
Okay. But do we need a return value? I'm thinking we can go through all banks
and log any and all Deferred errors rather than just the first one we find. I asked
and this is the preferred method like how we do in the #MC handler. The same
applies to the thresholding interrupt handler.
> If __log_error_deferred() has read a different type of error, we still log it? I'm
> not sure about this. I guess we can ignore that case for now.
>
Yeah, we should ignore other error types. They'll get picked up by other methods.
> Then:
>
> if (mca_flags.smca)
> __log_error_deferred_smca(bank));
>
> which handles the SMCA case. It too reads MSR_AMD64_SMCA_MCx_DESTAT
> and MSR_AMD64_SMCA_MCx_DEADDR and hands them down to
> __log_error() for logging.
>
Can this just be included in the __log_error_deferred()? I'm thinking something
similar to one of your earlier suggestions. Like this:
static void
__log_error_deferred(unsigned int bank)
{
u64 status, addr;
bool logged = false;
rdmsrl(msr_ops.status, status);
if (is_deferred_error(status)) {
rdmsrl(msr_ops.addr, addr);
__log_error(bank, status, addr, 0);
wrmsrl(msr_ops.status, 0);
logged = true;
}
if (!mca_flags.smca)
return;
if (logged) {
wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT, 0);
return;
}
rdmsrl(MSR_AMD64_SMCA_MCx_DESTAT(bank), status);
if (is_deferred_error(status)) {
rdmsrl(MSR_AMD64_SMCA_MCx_DEADDR, addr);
__log_error(bank, status, addr, 0);
wrmsrl(MSR_AMD64_SMCA_MCx_DESTAT, 0);
}
}
> For the __log_error() call in amd_threshold_interrupt(), you define a
> log_error() wrapper which reads the default MSRs and hands them down to
> __log_error().
>
Okay.
> So __log_error() always gets STATUS and ADDR MSR values and it doesn't
> need to read them from the MSRs but only log them.
>
Okay.
Thanks,
Yazen
Powered by blists - more mailing lists