[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZBR+GMH0olGoDMGs@yaz-fattaah>
Date: Fri, 17 Mar 2023 14:50:00 +0000
From: Yazen Ghannam <yazen.ghannam@....com>
To: Tony Luck <tony.luck@...el.com>
Cc: Borislav Petkov <bp@...en8.de>,
Smita.KoralahalliChannabasappa@....com,
dave.hansen@...ux.intel.com, hpa@...or.com,
linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
x86@...nel.org, patches@...ts.linux.dev
Subject: Re: [PATCH v2 0/5] Handle corrected machine check interrupt storms
On Mon, Jun 27, 2022 at 10:36:00AM -0700, Tony Luck wrote:
> Extend the logic of handling Intel's corrected machine check interrupt
> storms to AMD's threshold interrupts.
>
> First two patches are from Tony which cleans up the existing storm
> handling for Intel and proposes per CPU per bank storm handling.
>
> Third and fourth patches do some cleanup and refactoring on the CMCI
> storm handling in order to extend similar workaround for AMD's threshold
> interrupt storms. These two patches could be merged into Tony's second
> patch of CMCI storm mitigation.
>
> AMD's storm mitigation for threshold interrupts also relies on per CPU
> per bank approach similar to Intel. But unlike CMCI storm handling it does
> not set thresholds to reduce rate of interrupts on a storm. Rather it
> turns off the interrupt on the current CPU and bank if there is a storm
> and re-enables back the interrupts when the storm subsides.
>
> It is okay to turn off threshold interrupts on AMD systems as other error
> severities continue to be handled even if the threshold interrupts are
> turned off. Uncorrected errors will generate a #MC and deferred errors
> have a unique separate deferred error interrupt. The final patch adds
> support for handling threshold interrupt storms on AMD systems.
>
> Changes since v1:
>
> 1) Fix shift computation when keeping track of bank history. Shift
> should be "1" when a storm is in progress (because polling once per
> second). When a storm is not in progress shift should be based on
> number of seconds since the bank was last checked.
>
> 2) Changed Smita's code in part 0003 to avoid use of a function pointer
> (since the kernel is avoiding indirect branch points that might be
> trainable for various Spectre-like issues).
>
> Smita Koralahalli (2):
> x86/mce: Introduce mce_handle_storm() to deal with begin/end of storms
> x86/mce: Handle AMD threshold interrupt storms
> x86/mce: Move storm handling to core.
>
> Tony Luck (3):
> x86/mce: Remove old CMCI storm mitigation code
> x86/mce: Add per-bank CMCI storm mitigation
>
> arch/x86/kernel/cpu/mce/amd.c | 49 ++++++++
> arch/x86/kernel/cpu/mce/core.c | 139 +++++++++++++++++-----
> arch/x86/kernel/cpu/mce/intel.c | 179 +++++++----------------------
> arch/x86/kernel/cpu/mce/internal.h | 33 ++++--
> 4 files changed, 230 insertions(+), 170 deletions(-)
>
> --
Hi Tony,
Is there an updated version of this set? I can help review and test. Smita is
focusing on other items at the moment.
Thanks!
-Yazen
Powered by blists - more mailing lists