lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ba6eea97-116a-4678-7800-d24692c65cd6@amd.com>
Date:   Fri, 27 Oct 2023 10:35:33 +0530
From:   "M K, Muralidhara" <muralimk@....com>
To:     Borislav Petkov <bp@...en8.de>,
        Yazen Ghannam <yazen.ghannam@....com>
Cc:     linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
        mchehab@...nel.org, Muralidhara M K <muralidhara.mk@....com>,
        Avadhut Naik <Avadhut.Naik@....com>
Subject: Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code
 descriptions



On 10/26/2023 7:10 PM, Borislav Petkov wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
> 
> 
> On Thu, Oct 26, 2023 at 09:05:51AM -0400, Yazen Ghannam wrote:
>> Post-processing is one of the features that Avadhut implemented.
>>
>> https://github.com/mchehab/rasdaemon/commit/932118b04a04104dfac6b8536419803f236e6118
> 

Hi Yazen, Thanks for pointing to this commit. Yes I do remember.


> Yes, now try to decode the error with rasdaemon this way, by supplying
> the fields.
> 
> Then explain step-by-step what you've done in the commit message and in
> a documentation file in Documentation/ras/ so that people can find it
> and can actually do the decoding themselves.
> 
> It needs to be absolutely easy to decode those errors. Not tell people:
> "go look for the error description in the PPR".
> 
Yes, we have offline decoding option in rasdaemon

For example:
$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00 --smca
2023-10-26 23:51:34 -0500, Unified Memory Controller (bank=0), mca: DRAM 
ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic read, tx: 
generic, level: L3/generic', mci: Error_overflow CECC, Locn: 
memory_channel=0,csrow=0, Error Msg: Corrected error, no action required.

Observed the error string "mca: DRAM ECC error. Ext Err Code: 0"


Also, we can pass particular family/model to decode, Ex:for MI300A

$ rasdaemon -p --status 0xdc2040000000011b --ipid 0x0000609600092f00 
--smca --family 0x19 --model 0x90 --bank 19
2023-10-26 23:52:09 -0500, Unified Memory Controller (bank=19), mca: 
DRAM On Die ECC error. Ext Err Code: 0 Memory Error 'mem-tx: generic 
read, tx: generic, level: L3/generic', mci: Error_overflow CECC, Locn: 
memory_die_id=1, Error Msg: Corrected error, no action required.

Observed the error string as "mca: DRAM On Die ECC error. Ext Err Code: 0"

Thanks for the inputs. I will add the steps in commit message and in 
Documentation as well.


> Thx.
> 
> --
> Regards/Gruss,
>      Boris.
> 
> https://people.kernel.org/tglx/notes-about-netiquette
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ