lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 11 Mar 2019 18:52:05 +0000
From:   "Ghannam, Yazen" <Yazen.Ghannam@....com>
To:     Borislav Petkov <bp@...en8.de>
CC:     "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        Borislav Petkov <bp@...e.de>, Tony Luck <tony.luck@...el.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "rafal@...ecki.pl" <rafal@...ecki.pl>,
        "clemej@...il.com" <clemej@...il.com>
Subject: RE: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA
 errors on some Family 17h models

> -----Original Message-----
> From: linux-edac-owner@...r.kernel.org <linux-edac-owner@...r.kernel.org> On Behalf Of Borislav Petkov
> Sent: Monday, March 11, 2019 1:21 PM
> To: Ghannam, Yazen <Yazen.Ghannam@....com>
> Cc: linux-edac@...r.kernel.org; Borislav Petkov <bp@...e.de>; Tony Luck <tony.luck@...el.com>; x86@...nel.org; linux-
> kernel@...r.kernel.org; rafal@...ecki.pl; clemej@...il.com
> Subject: Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
> 
> On Thu, Mar 07, 2019 at 09:26:04PM +0000, Ghannam, Yazen wrote:
> > +static bool smca_filter_mce(struct mce *m)
> > +{
> > +	enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
> > +	struct cpuinfo_x86 *c = &boot_cpu_data;
> > +	u8 xec = XEC(m->status, xec_mask);
> > +
> > +	/*
> > +	 * Spurious errors of this type may be reported.
> > +	 * See Family 17h Models 10h-2Fh Erratum #1114.
> > +	 */
> > +	if (c->x86 == 0x17 &&
> > +	    (c->x86_model >= 0x10 && c->x86_model <= 0x2F) &&
> > +	    bank_type == SMCA_IF && xec == 10)
> > +		return true;
> 
> This is happening too late and we need it much earlier, from Rafal's dmesg:
> 
> [    1.070855] mce: [Hardware Error]: Machine check events logged
> [    1.070860] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: d8200000000a0151
> [    1.070863] mce: [Hardware Error]: TSC 73fa0765c MISC d01b0fff00000000 SYND 4a000000 IPID 100b000000000
> [    1.071065] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1543481411 SOCKET 0 APIC 2 microcode 810100b
> 
> that's __print_mce() from the notifier.
> 
> So we'd need a filter function which is called in do_machine_check() and
> machine_check_poll() right after we've collected enough info to be able
> to filter out the MCE based on the signature. In this case the extended
> error core and SMCA bank type suffices but we should put those functions
> late enough so that they can be used for other filtering later.
> 

Okay, understood.

Should I keep the filter in edac_mce_amd? I guess it's not necessary if filtered out earlier.

> Alternatively, if this error type has a special bit in the mask registers so
> that you can disable it there ala
> 
>         if (c->x86_vendor == X86_VENDOR_AMD) {
>                 if (c->x86 == 15 && cfg->banks > 4) {
>                         /*
>                          * disable GART TBL walk error reporting, which
>                          * trips off incorrectly with the IOMMU & 3ware
>                          * & Cerberus:
>                          */
>                         clear_bit(10, (unsigned long *)&mce_banks[4].ctl);
> 
> 
> that would be even better but I'd guess it doesn't have a special bit...
> 

Yes, that's right. Clearing a bit in MCA_CTL is not recommend in this case.

Thanks,
Yazen

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ