[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b384d621-6b2d-7aab-adbf-7045f23f4af9@linux.alibaba.com>
Date: Wed, 10 May 2023 10:17:18 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Yazen Ghannam <yazen.ghannam@....com>, bp@...en8.de,
tony.luck@...el.com
Cc: tglx@...utronix.de, mingo@...hat.com, dave.hansen@...ux.intel.com,
x86@...nel.org, hpa@...or.com, baolin.wang@...ux.alibaba.com,
linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory
failure
On 2023/5/9 22:25, Yazen Ghannam wrote:
> On 4/25/23 8:18 AM, Shuai Xue wrote:
>> When a deferred UE error is detected, e.g by background patrol scruber, it
>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>> The handler will collect MCA banks, init mce struct and process it by
>> nofitying the registered MCE decode chain.
>>
>> The uc_decode_notifier, one of MCE decode chain, will process memory
>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>> However, APIC interrupt handler does not init mce severity and the
>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>
>> To handle the deferred memory failure case, init mce severity when logging
>> MCA banks.
>>
>> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
>>
>
> Hi Shuai Xue,
>
> I think this patch is fair to do. But it won't have the intended effect
> in practice.
>
> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
> "normalized address". This is not a system physical address that the OS
> can use to take action.
>
> The mce_usable_address() function needs to be updated to handle this.
> I'll send a patchset this week to do so. Afterwards, the
> uc_decode_notifier will not attempt to handle these errors.
>From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
uc_decode_notifier should handle these error to hard offline the corrupted
page. If the corrupted page is a free buddy page, we can isolate it and avoid
using the page in the future.
In my test case, the error is detected by patrol scrubber in memory controller.
The scrubber may lack of system address space perspective, and only reports
"normalized address". But we can decode the "normalized address" to system address
by EDAC (umc_normaddr_to_sysaddr), right?
(I am not quite familiar with AMD RAS, please correct me if I am wrong)
>
> Thanks,
> Yazen
Thank you.
Best Regards,
Shuai
Powered by blists - more mailing lists