[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150504154652.GF3829@pd.tnic>
Date: Mon, 4 May 2015 17:46:52 +0200
From: Borislav Petkov <bp@...en8.de>
To: Aravind Gopalakrishnan <aravind.gopalakrishnan@....com>
Cc: tglx@...utronix.de, mingo@...hat.com, hpa@...or.com,
tony.luck@...el.com, jiang.liu@...ux.intel.com, yinghai@...nel.org,
x86@...nel.org, dvlasenk@...hat.com, JBeulich@...e.com,
slaoub@...il.com, luto@...capital.net, dave.hansen@...ux.intel.com,
oleg@...hat.com, rostedt@...dmis.org, rusty@...tcorp.com.au,
prarit@...hat.com, linux@...musvillemoes.dk, jroedel@...e.de,
andriy.shevchenko@...ux.intel.com, macro@...ux-mips.org,
wangnan0@...wei.com, linux-kernel@...r.kernel.org,
linux-edac@...r.kernel.org, Robert Richter <rric@...nel.org>
Subject: Re: [PATCH 2/4] x86/mce/amd: Introduce deferred error interrupt
handler
On Mon, May 04, 2015 at 10:29:50AM -0500, Aravind Gopalakrishnan wrote:
> >What's the family check for? for BIOSes which don't set the LVT offset
> >to 2, as they should?
> >
> >If so, we probably should say
> >
> > pr_err(FW_BUG "Your BIOS is not setting up LVT offset 0x2 for deferred error IRQs correctly.\n");
> >
> >or similar...
>
> Yeah. I meant to provide a comment at least for this.
> Forgot to do that.
>
> I'll print out a error message as you suggested (considering we do this in
> other places like threshold setup or IBS setup..)
lvt_off_valid() does that already. Adding Robert.
> Right. I think a __log_error() is a good idea.
> Except, in amd_threshold_interrupt(), we have-
> m.misc = ((u64)high << 32) | low;
>
> which, is actually useless as we don't use m.misc anywhere in
> amd_decode_mce() or anywhere else in the decoding pipeline AFAICT.
> We only print out if 'misc' is valid and we only need status bits for that-
> ((m->status & MCI_STATUS_MISCV) ? "MiscV" : "-"),
>
> But, more importantly, we don't setup 'm.addr' here (in
> amd_threshold_interrupt() or in amd_deferred_error_interrupt())
> Which means anytime we pass an error to be decoded from the interrupt
> handlers, we don't get any info about the error address.
So what are we reporting with a deferred error if it is not a
full-fledged MCE? We better fix that otherwise we probably shouldn't
even report those. I mean, userspace is supposed to do some error
handling based on error info but if that info's missing, we might just
as well panic right then and there, right?
> So, we can do one of these-
> 1. Remove m.misc setup in amd_threshold_interrupt() and
> rdmsrl(MSR_IA32_MCx_ADDR(bank), m.addr) before we call mce_log()
> 2. Since we have mce_read_aux() that reads misc and addr registers, we can
> move the mce_[rd|wr]msrl wrappers and mce_read_aux() into mce.h and use it
> here in mce_amd.c
>
> Thoughts?
Makes sense but you need to first check though, which registers are
valid in the hw when a threshold/deferred error happens and collect
them. Only then we can do proper recovery.
Thanks.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists