lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 22 Dec 2014 14:10:09 -0600
From:	Aravind Gopalakrishnan <Aravind.Gopalakrishnan@....com>
To:	<tglx@...utronix.de>, <mingo@...hat.com>, <hpa@...or.com>,
	<tony.luck@...el.com>, <bp@...en8.de>, <dougthompson@...ssion.com>,
	<mchehab@....samsung.com>, <x86@...nel.org>,
	<linux-kernel@...r.kernel.org>, <linux-edac@...r.kernel.org>
CC:	<dave.hansen@...ux.intel.com>, <mgorman@...e.de>, <bp@...e.de>,
	<riel@...hat.com>, <jacob.w.shin@...il.com>,
	Aravind Gopalakrishnan <Aravind.Gopalakrishnan@....com>
Subject: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

When a MCE happens that is to be logged onto bank 4 of AMD multi-node
processors, they are reported only to corresponding node base core of
the cpu on which the error occurred.

Refer D18F3x44[NbMcaToMstCpuEn] on BKDGs of Fam10h and later for
clarifications on the reporting of MC4 errors only to NBC MSRs.

We don't have the exception handler wired up to handle this currently.
As a consequence, do_machine_check only runs on the core on which error
occurred and (since according to the BKDGs, reads to MC4_STATUS MSR of
non-NBC will simply RAZ) the exception is ignored for the core.

This is a problem as now we have dropped MCEs.
I tested this on Fam10h and Fam15h using mce_amd_inj and by triggering
a real HW MCE using Boris' new interface; And can confirm the behavior.

This patch set fixes the issue by looking at the NBC MSRs when bank 4
errors happen on AMD multi node processors.

Patch 1: Refactor AMD cpu topology functions so that we can get some
	 relevant info that we need to use in EDAC, MC handler routines
Patch 2: The fix to our problem
Patch 3: Modify mce_amd_inj interfaces to write to only NBC for bank 4
	 errors. Only then will they be picked up for error handling.

Aravind Gopalakrishnan (3):
  x86,amd: Refactor amd cpu topology functions for multi-node processors
  x86, mce: Handle AMD MCE on bank4 on NBC for multi-node processors
  edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors

 arch/x86/include/asm/processor.h |   1 +
 arch/x86/kernel/cpu/amd.c        |  78 ++++++++++++++----
 arch/x86/kernel/cpu/mcheck/mce.c | 167 +++++++++++++++++++++++++++++++++++----
 drivers/edac/mce_amd_inj.c       |  21 ++++-
 4 files changed, 235 insertions(+), 32 deletions(-)

-- 
2.0.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists