[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <SN6PR12MB263931263441531964542EB4F8480@SN6PR12MB2639.namprd12.prod.outlook.com>
Date: Mon, 11 Mar 2019 18:52:05 +0000
From: "Ghannam, Yazen" <Yazen.Ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
CC: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
Borislav Petkov <bp@...e.de>, Tony Luck <tony.luck@...el.com>,
"x86@...nel.org" <x86@...nel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"rafal@...ecki.pl" <rafal@...ecki.pl>,
"clemej@...il.com" <clemej@...il.com>
Subject: RE: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA
errors on some Family 17h models
> -----Original Message-----
> From: linux-edac-owner@...r.kernel.org <linux-edac-owner@...r.kernel.org> On Behalf Of Borislav Petkov
> Sent: Monday, March 11, 2019 1:21 PM
> To: Ghannam, Yazen <Yazen.Ghannam@....com>
> Cc: linux-edac@...r.kernel.org; Borislav Petkov <bp@...e.de>; Tony Luck <tony.luck@...el.com>; x86@...nel.org; linux-
> kernel@...r.kernel.org; rafal@...ecki.pl; clemej@...il.com
> Subject: Re: [PATCH 2/2] x86/MCE/AMD, EDAC/mce_amd: Don't report L1 BTB MCA errors on some Family 17h models
>
> On Thu, Mar 07, 2019 at 09:26:04PM +0000, Ghannam, Yazen wrote:
> > +static bool smca_filter_mce(struct mce *m)
> > +{
> > + enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
> > + struct cpuinfo_x86 *c = &boot_cpu_data;
> > + u8 xec = XEC(m->status, xec_mask);
> > +
> > + /*
> > + * Spurious errors of this type may be reported.
> > + * See Family 17h Models 10h-2Fh Erratum #1114.
> > + */
> > + if (c->x86 == 0x17 &&
> > + (c->x86_model >= 0x10 && c->x86_model <= 0x2F) &&
> > + bank_type == SMCA_IF && xec == 10)
> > + return true;
>
> This is happening too late and we need it much earlier, from Rafal's dmesg:
>
> [ 1.070855] mce: [Hardware Error]: Machine check events logged
> [ 1.070860] mce: [Hardware Error]: CPU 2: Machine Check: 0 Bank 1: d8200000000a0151
> [ 1.070863] mce: [Hardware Error]: TSC 73fa0765c MISC d01b0fff00000000 SYND 4a000000 IPID 100b000000000
> [ 1.071065] mce: [Hardware Error]: PROCESSOR 2:810f10 TIME 1543481411 SOCKET 0 APIC 2 microcode 810100b
>
> that's __print_mce() from the notifier.
>
> So we'd need a filter function which is called in do_machine_check() and
> machine_check_poll() right after we've collected enough info to be able
> to filter out the MCE based on the signature. In this case the extended
> error core and SMCA bank type suffices but we should put those functions
> late enough so that they can be used for other filtering later.
>
Okay, understood.
Should I keep the filter in edac_mce_amd? I guess it's not necessary if filtered out earlier.
> Alternatively, if this error type has a special bit in the mask registers so
> that you can disable it there ala
>
> if (c->x86_vendor == X86_VENDOR_AMD) {
> if (c->x86 == 15 && cfg->banks > 4) {
> /*
> * disable GART TBL walk error reporting, which
> * trips off incorrectly with the IOMMU & 3ware
> * & Cerberus:
> */
> clear_bit(10, (unsigned long *)&mce_banks[4].ctl);
>
>
> that would be even better but I'd guess it doesn't have a special bit...
>
Yes, that's right. Clearing a bit in MCA_CTL is not recommend in this case.
Thanks,
Yazen
Powered by blists - more mailing lists