[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <feea1978-04ef-82f9-933e-4ae280391aea@amd.com>
Date: Fri, 22 Mar 2019 19:24:01 +0000
From: "Ghannam, Yazen" <Yazen.Ghannam@....com>
To: Borislav Petkov <bp@...en8.de>
CC: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"tony.luck@...el.com" <tony.luck@...el.com>,
"x86@...nel.org" <x86@...nel.org>,
"rafal@...ecki.pl" <rafal@...ecki.pl>,
"clemej@...il.com" <clemej@...il.com>
Subject: Re: [PATCH v2 2/2] x86/MCE/AMD: Don't report L1 BTB MCA errors on
some Family 17h models
On 3/22/2019 12:34 PM, Borislav Petkov wrote:
> On Thu, Mar 21, 2019 at 08:25:18PM +0000, Ghannam, Yazen wrote:
>> From: Yazen Ghannam <yazen.ghannam@....com>
>>
>> AMD Family 17h Models 10h-2Fh may report a high number of L1 BTB MCA
>> errors under certain conditions. The errors are benign and can safely be
>> ignored. However, the high error rate may cause the MCA threshold
>> counter to overflow causing a high rate of thresholding interrupts. In
>> addition, users may see the errors reported through the AMD MCE decoder
>> module, even with the interrupt disabled, due to MCA polling.
>>
>> This error is reported through the Instruction Fetch bank.
>>
>> Clear the "Counter Present" bit in the Instruction Fetch bank's
>> MCA_MISC0 register. This will prevent enabling MCA thresholding on this
>> bank which will prevent the high interrupt rate due to this error.
>>
>> Define a function to filter these errors from the MCE event pool.
>> Install this function during AMD vendor init. The MCA banks are enabled
>> after vendor init, so the filter function will be installed before the
>> spurious errors will be reported.
>>
>> Cc: <stable@...r.kernel.org> # 4.14.x: c95b323dcd35: x86/MCE/AMD: Turn off MC4_MISC thresholding on all family 0x15 models
>> Cc: <stable@...r.kernel.org> # 4.14.x: 30aa3d26edb0: x86/MCE/AMD: Carve out the MC4_MISC thresholding quirk
>> Cc: <stable@...r.kernel.org> # 4.14.x
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
>> ---
>> Link:
>> https://lkml.kernel.org/r/20190307212552.8865-2-Yazen.Ghannam@amd.com
>>
>> v1->v2:
>> * Filter out the error earlier in MCE code rather than later in EDAC.
>>
>> arch/x86/kernel/cpu/mce/amd.c | 57 ++++++++++++++++++++++++++++-------
>> 1 file changed, 46 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>> index e64de5149e50..2db85f65b41e 100644
>> --- a/arch/x86/kernel/cpu/mce/amd.c
>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>> @@ -563,22 +563,55 @@ prepare_threshold_block(unsigned int bank, unsigned int block, u32 addr,
>> return offset;
>> }
>>
>> +bool filter_mce_rv(struct mce *m)
>> +{
>> + enum smca_bank_types bank_type = smca_get_bank_type(m->bank);
>> + u8 xec = (m->status >> 16) & 0x3F;
>> +
>> + /*
>> + * Spurious errors of this type may be reported.
>> + * See Family 17h Models 10h-2Fh Erratum #1114.
>> + */
>> + if (bank_type == SMCA_IF && xec == 10)
>> + return true;
>> +
>> + return false;
>> +}
>> +
>> +static void filter_mce_check(struct cpuinfo_x86 *c)
>> +{
>> + if (c->x86 == 0x17 && (c->x86_model >= 0x10 && c->x86_model <= 0x2F))
>> + filter_mce = filter_mce_rv;
>> +}
>
> Why all the noodling here with a check function which assigns a
> filter_mce_rv (btw, that "rv" means nothing outside of AMD) and a
> generic default_filter_mce?
>
Okay. It could be named something else. This model group is Raven Ridge, so I thought "rv" would be a good. Maybe spelling it out would be better? Or should it just be something family/model?
> Why not a simple filter_mce() in mce/core.c which calls amd_filter_mce()
> based on x86_vendor and amd_filter_mce() is defined in mce/amd.c?
>
Generally, the model groups share the same hardware design and so the same quirks. So I'm thinking that it'd be more efficient to have a filter function that targets a specific group of models rather than one that checks all known quirks on all models.
Most of the quirks are dealt with at init time, but this needs be to done during run time for each MCE that is logged. So I didn't want to add unnecessary checks to the MCE handlers. We have quirk_no_way_out() that does something similar.
Thanks,
Yazen
Powered by blists - more mailing lists