[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230929181626.210782-1-tony.luck@intel.com>
Date: Fri, 29 Sep 2023 11:16:23 -0700
From: Tony Luck <tony.luck@...el.com>
To: Borislav Petkov <bp@...en8.de>
Cc: Yazen Ghannam <yazen.ghannam@....com>,
Smita.KoralahalliChannabasappa@....com,
dave.hansen@...ux.intel.com, x86@...nel.org,
linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
patches@...ts.linux.dev, Tony Luck <tony.luck@...el.com>
Subject: [PATCH v8 0/3] Handle corrected machine check interrupt storms
Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.
There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.
This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.
Signed-off-by: Tony Luck <tony.luck@...el.com>
---
Changes since v7:
Applied all the suggestions from Yazen's review of v7
Link: https://lore.kernel.org/all/c76723df-f2f1-4888-9e05-61917145503c@amd.com/
Link: https://lore.kernel.org/all/6ae4df67-ba0b-4b50-8c1d-a5d382105ad2@amd.com/
Including placing most of the storm tracking code into threshold.c
instead of bloating core.c.
Tony Luck (3):
x86/mce: Remove old CMCI storm mitigation code
x86/mce: Add per-bank CMCI storm mitigation
x86/mce: Handle Intel threshold interrupt storms
arch/x86/kernel/cpu/mce/internal.h | 47 +++-
arch/x86/kernel/cpu/mce/core.c | 45 ++--
arch/x86/kernel/cpu/mce/intel.c | 338 ++++++++++++----------------
arch/x86/kernel/cpu/mce/threshold.c | 86 +++++++
4 files changed, 293 insertions(+), 223 deletions(-)
base-commit: 6465e260f48790807eef06b583b38ca9789b6072
--
2.41.0
Powered by blists - more mailing lists