lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <feea1978-04ef-82f9-933e-4ae280391aea@amd.com>
Date:   Fri, 22 Mar 2019 19:24:01 +0000
From:   "Ghannam, Yazen" <Yazen.Ghannam@....com>
To:     Borislav Petkov <bp@...en8.de>
CC:     "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "x86@...nel.org" <x86@...nel.org>,
        "rafal@...ecki.pl" <rafal@...ecki.pl>,
        "clemej@...il.com" <clemej@...il.com>
Subject: Re: [PATCH v2 2/2] x86/MCE/AMD: Don't report L1 BTB MCA errors on
 some Family 17h models

On 3/22/2019 12:34 PM, Borislav Petkov wrote:
> On Thu, Mar 21, 2019 at 08:25:18PM +0000, Ghannam, Yazen wrote:
>> From: Yazen Ghannam <yazen.ghannam@....com>
>>
>> AMD Family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
>> errors under certain conditions. The errors are benign and can safely be
>> ignored. However, the high error rate may cause the MCA threshold
>> counter to overflow causing a high rate of thresholding interrupts. In
>> addition, users may see the errors reported through the AMD MCE decoder
>> module, even with the interrupt disabled, due to MCA polling.
>>
>> This error is reported through the Instruction Fetch bank.
>>
>> Clear the "Counter Present" bit in the Instruction Fetch bank's
>> MCA_MISC0 register. This will prevent enabling MCA thresholding on this
>> bank which will prevent the high interrupt rate due to this error.
>>
>> Define a function to filter these errors from the MCE event pool.
>> Install this function during AMD vendor init. The MCA banks are enabled
>> after vendor init, so the filter function will be installed before the
>> spurious errors will be reported.
>>
>> Cc: <stable@...r.kernel.org> # 4.14.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
>> Cc: <stable@...r.kernel.org> # 4.14.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
>> Cc: <stable@...r.kernel.org> # 4.14.x
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
>> ---
>> Link:
>> https://lkml.kernel.org/r/20190307212552.8865-2-Yazen.Ghannam@amd.com
>>
>> v1->v2:
>> * Filter out the error earlier in MCE code rather than later in EDAC.
>>
>>  arch/x86/kernel/cpu/mce/amd.c | 57 ++++++++++++++++++++++++++++-------
>>  1 file changed, 46 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>> index e64de5149e50..2db85f65b41e 100644
>> --- a/arch/x86/kernel/cpu/mce/amd.c
>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>> @@ -563,22 +563,55 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
>>  	return offset;
>>  }
>>  
>> +bool filter_mce_rv(struct mce *m)
>> +{
>> +	enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
>> +	u8 xec = (m->status >> 16) & 0x3F;
>> +
>> +	/*
>> +	 * Spurious errors of this type may be reported.
>> +	 * See Family 17h Models 10h-2Fh Erratum #1114.
>> +	 */
>> +	if (bank_type == SMCA_IF && xec == 10)
>> +		return true;
>> +
>> +	return false;
>> +}
>> +
>> +static void filter_mce_check(struct cpuinfo_x86 *c)
>> +{
>> +	if (c->x86 == 0x17 && (c->x86_model >= 0x10 && c->x86_model <= 0x2F))
>> +		filter_mce = filter_mce_rv;
>> +}
> 
> Why all the noodling here with a check function which assigns a
> filter_mce_rv (btw, that "rv" means nothing outside of AMD) and a
> generic default_filter_mce?
> 

Okay. It could be named something else. This model group is Raven Ridge, so I thought "rv" would be a good. Maybe spelling it out would be better? Or should it just be something family/model?

> Why not a simple filter_mce() in mce/core.c which calls amd_filter_mce()
> based on x86_vendor and amd_filter_mce() is defined in mce/amd.c?
> 

Generally, the model groups share the same hardware design and so the same quirks. So I'm thinking that it'd be more efficient to have a filter function that targets a specific group of models rather than one that checks all known quirks on all models.

Most of the quirks are dealt with at init time, but this needs be to done during run time for each MCE that is logged. So I didn't want to add unnecessary checks to the MCE handlers. We have quirk_no_way_out() that does something similar.

Thanks,
Yazen

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ