lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20231115195450.12963-1-tony.luck@intel.com>
Date:   Wed, 15 Nov 2023 11:54:47 -0800
From:   Tony Luck <tony.luck@...el.com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     Yazen Ghannam <yazen.ghannam@....com>,
        Smita.KoralahalliChannabasappa@....com,
        dave.hansen@...ux.intel.com, x86@...nel.org,
        linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
        patches@...ts.linux.dev, Tony Luck <tony.luck@...el.com>
Subject: [PATCH v10 0/3] Handle corrected machine check interrupt storms

Linux CMCI storm mitigation is a big hammer that just disables the CMCI
interrupt globally and switches to polling all banks.

There are two problems with this:
1) It really is a big hammer. It means that errors reported in other
banks from different functional units are all subject to the same
polling delay before being processed.
2) Intel systems signal some uncorrected errors using CMCI (e.g.
memory controller patrol scrub on Icelake Xeon and newer). Delaying
processing these error reports negates some of the benefit of the patrol
scrubber providing early notice of errors before they are consumed and
cause a machine check.

This series throws away the old storm implementation and replaces it
with one that keeps track of the weather on each separate machine check
bank. When a storm is detected from a bank. On Intel the storm is
mitigated by setting a very high threshold for corrected errors to
signal CMCI. This threshold does not affect signaling CMCI for
uncorrected errors.

Signed-off-by: Tony Luck <tony.luck@...el.com>

---
Changes since v9 (based on Boris reviews)

#1 Better commit comment on flow. Added detail that both timer poll
   and CMCI feed results of scanning each bank into the history
   calculation. Also added comment in code where mce_trac_storm()
   is called.
#2 Set a flag for banks that don't support CMCI so they can be
   excluded from history processing
#3 Skip history processing if CMCI globally disabled with boot
   argument mce=cmci_disable
#4 Move struct mca_storm_desc definition to internal.h (I had argued
   against the need for this, but the new "poll_mode" flag added in
   change #2 needs to be set in intel.c).
#5 Add #define NUM_HISTORY_BITS instead of hard-coded "64".
#6 Rebase to v6.7-rc1
  

Tony Luck (3):
  x86/mce: Remove old CMCI storm mitigation code
  x86/mce: Add per-bank CMCI storm mitigation
  x86/mce: Handle Intel threshold interrupt storms

 arch/x86/kernel/cpu/mce/internal.h  |  66 +++++-
 arch/x86/kernel/cpu/mce/core.c      |  53 +++--
 arch/x86/kernel/cpu/mce/intel.c     | 304 ++++++++++++----------------
 arch/x86/kernel/cpu/mce/threshold.c | 115 +++++++++++
 4 files changed, 332 insertions(+), 206 deletions(-)


base-commit: b85ea95d086471afb4ad062012a4d73cd328fa86
-- 
2.41.0

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ