[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <15355297-4ff3-4626-b5d5-ac50aea87589@amd.com>
Date: Fri, 21 Nov 2025 01:04:47 -0600
From: "Naik, Avadhut" <avadnaik@....com>
To: Greg KH <gregkh@...uxfoundation.org>
Cc: stable@...r.kernel.org, sashal@...nel.org, linux-kernel@...r.kernel.org,
Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>,
Tony Luck <tony.luck@...el.com>, Yazen Ghannam <yazen.ghannam@....com>,
Borislav Petkov <bp@...en8.de>, Qiuxu Zhuo <qiuxu.zhuo@...el.com>,
Avadhut Naik <avadhut.naik@....com>
Subject: [PATCH] x86/mce: Handle AMD threshold interrupt storms
On 11/21/2025 00:53, Greg KH wrote:
> On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
>> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
>>
>> Extend the logic of handling CMCI storms to AMD threshold interrupts.
>>
>> Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and
>> per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on
>> a storm. Rather, disable the interrupt on the corresponding CPU and bank.
>> Re-enable back the interrupts if enough consecutive polls of the bank show no
>> corrected errors (30, as programmed by Intel).
>>
>> Turning off the threshold interrupts would be a better solution on AMD systems
>> as other error severities will still be handled even if the threshold
>> interrupts are disabled.
>>
>> Also, AMD systems currently allow banks to be managed by both polling and
>> interrupts. So don't modify the polling banks set after a storm ends.
>>
>> [Tony: Small tweak because mce_handle_storm() isn't a pointer now]
>> [Yazen: Rebase and simplify]
>>
>> Stable backport notes:
>> 1. Currently, when a Machine check interrupt storm is detected, the bank's
>> corresponding bit in mce_poll_banks per-CPU variable is cleared by
>> cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or
>> encountered after the storm subsides are not logged since polling on that
>> bank has been disabled. Polling banks set on AMD systems should not be
>> modified when a storm subsides.
>>
>> 2. This patch is a snippet from the CMCI storm handling patch (link below)
>> that has been accepted into tip for v6.19. While backporting the patch
>> would have been the preferred way, the same cannot be undertaken since
>> its part of a larger set. As such, this fix will be temporary. When the
>> original patch and its set is integrated into stable, this patch should be
>> reverted.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@....com>
>> Signed-off-by: Tony Luck <tony.luck@...el.com>
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@....com>
>> Signed-off-by: Borislav Petkov (AMD) <bp@...en8.de>
>> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@...el.com>
>> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@amd.com
>> Signed-off-by: Avadhut Naik <avadhut.naik@....com>
>> ---
>> This is somewhat of a new scenario for me. Not really sure about the
>> procedure. Hence, haven't modified the commit message and removed the
>> tags. If required, will rework both.
>> Also, while this issue can be encountered on AMD systems using v6.8 and
>> later stable kernels, we would specifically prefer for this fix to be
>> backported to v6.12 since its LTS.
>
> What is the git commit id of this change in Linus's tree?
I think it has not yet been merged into mainline's master branch.
This commit was recently accepted into the tip (5th November).
Following is its commit ID:
a5834a5458aa004866e7da402c6bc2dfe2f3737e
Link: https://lore.kernel.org/all/176243356968.2601451.11559805061162819633.tip-bot2@tip-bot2/
Do I need to send another version with this commit ID mentioned in the commit message?
--
Thanks,
Avadhut Naik
Powered by blists - more mailing lists