[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20200111131744.GC23583@zn.tnic>
Date: Sat, 11 Jan 2020 14:17:44 +0100
From: Borislav Petkov <bp@...en8.de>
To: "Luck, Tony" <tony.luck@...el.com>
Cc: Jan H. Schönherr <jschoenh@...zon.de>,
Yazen Ghannam <yazen.ghannam@....com>,
linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
Thomas Gleixner <tglx@...utronix.de>,
Ingo Molnar <mingo@...hat.com>,
"H. Peter Anvin" <hpa@...or.com>, x86@...nel.org
Subject: Re: [PATCH v2 1/6] x86/mce: Take action on UCNA/Deferred errors again
On Fri, Jan 10, 2020 at 10:45:33AM -0800, Luck, Tony wrote:
> I totally agree that counting notifiers is clumsy. Also less than
> ideal is the concept that any notifier on the chain can declare:
> "I fixed it"
> and prevent any other notifiers from even seeing it. Well the concept
> is good, but it is overused.
But why can't we use it?
Don't get me wrong: I'm simply following my KISS approach to do the
simplest scheme required. So, do you see a use case where the whole
error handling chain would need more sophisticated handling?
> I think we may do better with a field in the "struct mce" that is being
> passed to each where notifiers can wiggle some bits (semantics to be
> defined later) which can tell subsequent notifiers what sort of actions
> have been taken.
> E.g. the SRAO/UCNA notifier can say "I took this page offline"
> the dev_mcelog one can say "I think I handed to a process that has /dev/mcelog open"
> EDAC drivers can say "I decoded the address and printed something"
> CEC can say: "I silently counted this corrected error", or "error exceeded
> threshold and I took the page offline".
>
> The default notifier can print to console if nobody set a bit to say
> that the error had been somehow logged.
That idea is good and I'll gladly take patches for it so if you wanna do
it...
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
Powered by blists - more mailing lists