[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ba6eea97-116a-4678-7800-d24692c65cd6@amd.com>
Date: Fri, 27 Oct 2023 10:35:33 +0530
From: "M K, Muralidhara" <muralimk@....com>
To: Borislav Petkov <bp@...en8.de>,
Yazen Ghannam <yazen.ghannam@....com>
Cc: linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
mchehab@...nel.org, Muralidhara M K <muralidhara.mk@....com>,
Avadhut Naik <Avadhut.Naik@....com>
Subject: Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code
descriptions
On 10/26/2023 7:10 PM, Borislav Petkov wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Thu, Oct 26, 2023 at 09:05:51AM -0400, Yazen Ghannam wrote:
>> Post-processing is one of the features that Avadhut implemented.
>>
>> https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536419803f236e6118
>
Hi Yazen, Thanks for pointing to this commit. Yes I do remember.
> Yes, now try to decode the error with rasdaemon this way, by supplying
> the fields.
>
> Then explain step-by-step what you've done in the commit message and in
> a documentation file in Documentation/ras/ so that people can find it
> and can actually do the decoding themselves.
>
> It needs to be absolutely easy to decode those errors. Not tell people:
> "go look for the error description in the PPR".
>
Yes, we have offline decoding option in rasdaemon
For example:
$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00 --smca
2023-10-26 23:51:34 -0500, Unified Memory Controller (bank=0), mca: DRAM
ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx:
generic, level: L3/generic', mci: Error_overflow CECC, Locn:
memory_channel=0,csrow=0, Error Msg: Corrected error, no action required.
Observed the error string "mca: DRAM ECC error. Ext Err Code: 0"
Also, we can pass particular family/model to decode, Ex:for MI300A
$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00
--smca --family 0x19 --model 0x90 --bank 19
2023-10-26 23:52:09 -0500, Unified Memory Controller (bank=19), mca:
DRAM On Die ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic
read, tx: generic, level: L3/generic', mci: Error_overflow CECC, Locn:
memory_die_id=1, Error Msg: Corrected error, no action required.
Observed the error string as "mca: DRAM On Die ECC error. Ext Err Code: 0"
Thanks for the inputs. I will add the steps in commit message and in
Documentation as well.
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>
Powered by blists - more mailing lists