[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <34fa94f5-359f-f3e7-92ae-fcdc06ff19b8@amd.com>
Date: Wed, 10 May 2023 09:59:36 -0400
From: Yazen Ghannam <yazen.ghannam@....com>
To: Shuai Xue <xueshuai@...ux.alibaba.com>, bp@...en8.de,
tony.luck@...el.com
Cc: yazen.ghannam@....com, tglx@...utronix.de, mingo@...hat.com,
dave.hansen@...ux.intel.com, x86@...nel.org, hpa@...or.com,
baolin.wang@...ux.alibaba.com, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] x86/mce/amd: init mce severity to handle deferred memory
failure
On 5/9/23 10:17 PM, Shuai Xue wrote:
>
>
> On 2023/5/9 22:25, Yazen Ghannam wrote:
>> On 4/25/23 8:18 AM, Shuai Xue wrote:
>>> When a deferred UE error is detected, e.g by background patrol scruber, it
>>> will be handled in APIC interrupt handler amd_deferred_error_interrupt().
>>> The handler will collect MCA banks, init mce struct and process it by
>>> nofitying the registered MCE decode chain.
>>>
>>> The uc_decode_notifier, one of MCE decode chain, will process memory
>>> failure but only limit to MCE_AO_SEVERITY and MCE_DEFERRED_SEVERITY.
>>> However, APIC interrupt handler does not init mce severity and the
>>> uninitialized severity is 0 (MCE_NO_SEVERITY).
>>>
>>> To handle the deferred memory failure case, init mce severity when logging
>>> MCA banks.
>>>
>>> Signed-off-by: Shuai Xue <xueshuai@...ux.alibaba.com>
>>>
>>
>> Hi Shuai Xue,
>>
>> I think this patch is fair to do. But it won't have the intended effect
>> in practice.
>>
>> The value in MCA_ADDR for DRAM ECC errors will be a memory controller
>> "normalized address". This is not a system physical address that the OS
>> can use to take action.
>>
>> The mce_usable_address() function needs to be updated to handle this.
>> I'll send a patchset this week to do so. Afterwards, the
>> uc_decode_notifier will not attempt to handle these errors.
>
> From the experience of other platforms (e.g. ARM64 RAS and Intel MCA),
> uc_decode_notifier should handle these error to hard offline the corrupted
> page. If the corrupted page is a free buddy page, we can isolate it and avoid
> using the page in the future.
>
> In my test case, the error is detected by patrol scrubber in memory controller.
> The scrubber may lack of system address space perspective, and only reports
> "normalized address". But we can decode the "normalized address" to system address
> by EDAC (umc_normaddr_to_sysaddr), right?
>
> (I am not quite familiar with AMD RAS, please correct me if I am wrong)
>
Yes, that's correct.
The address translation requires some updates that are still in-review.
Afterwards, we can investigate ways to use the translated address. It
may require some rework in the MCE notifier chain or, more simply,
calling memory_failure() from the EDAC module itself.
Thanks,
Yazen
Powered by blists - more mailing lists