[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b3b21eaa-226f-e78f-14e3-09e2e02e38d6@amd.com>
Date: Thu, 26 Oct 2023 15:12:22 +0530
From: "M K, Muralidhara" <muralimk@....com>
To: Borislav Petkov <bp@...en8.de>
Cc: linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
mchehab@...nel.org, Muralidhara M K <muralidhara.mk@....com>,
Yazen Ghannam <yazen.ghannam@....com>
Subject: Re: [PATCH v2 1/4] EDAC/mce_amd: Remove SMCA Extended Error code
descriptions
Hi Boris,
On 10/26/2023 12:38 AM, Borislav Petkov wrote:
> Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding.
>
>
> On Wed, Oct 25, 2023 at 05:14:52AM +0000, Muralidhara M K wrote:
>> The SMCA error decoding already exists in rasdaemon and future bank decoding
>> is supported from below patches merged in rasdaemon.
>> https://github.com/mchehab/rasdaemon/commit/1f74a59ee33b7448b00d7ba13d5ecd4918b9853c rasdaemon: Add new MA_LLC, USR_DP, and USR_CP bank types
>> https://github.com/mchehab/rasdaemon/commit/2d15882a0cbfce0b905039bebc811ac8311cd739 rasdaemon: Handle reassigned bit definitions for UMC bank
>>
>
> I'm still missing here the exact steps a user needs to do in order to
> decode such an error.
>
> Please inject an error, catch the error message and show me how one is
> supposed to decode it with rasdaemon in case the daemon is not running
> while the error happens or the error is fatal and the machine doesn't
> even get to run userspace.
>
> If that is not possible with rasdaemon yet, then this patch should not
> remove the error descriptions but limit them only to the families for
> which they're valid.
>
> Bottom line is, I don't want to have the situation mcelog is in where
> decoding errors with it is a total disaster.
>
> IOW, I'd like error decoding on AMD to always work and be trivially easy
> to do.
>
I have injected error, dmesg log below
[ 3991.560180] mce: [Hardware Error]: Machine check events logged
[ 3991.560195] [Hardware Error]: Corrected error, no action required.
[ 3991.567119] [Hardware Error]: CPU:2 (19:90:0)
MC25_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|CECC|-|-|-]: 0xdc2040000000011b
[ 3991.579205] [Hardware Error]: Error Addr: 0x0000000000000040
[ 3991.585546] [Hardware Error]: PPIN: 0xabcdef0000000000
[ 3991.591302] [Hardware Error]: IPID: 0x0000009600792f00, Syndrome:
0x000000000a000000
[ 3991.599977] [Hardware Error]: Unified Memory Controller Ext. Error
Code: 0
[ 3991.599985] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
From above logs, "Ext. Error Code: 0" here we are printing only the
error code and from this patch error strings have been removed.
User can refer the PPR to check what the error code refers to.
or rasdaemon tool can print the respective error string for particular
error code.
Executed rasdaemon:
rasdaemon: Listening to events for cpus 0 to 191
<...>-1420 [002] .... 0.000399 mce_record 2023-10-26
04:28:37 -0500 Unified Memory Controller (bank=25), status=
dc2040000000011b, Corrected error, no action required.,
mci=Error_overflow CECC, mca=DRAM On Die ECC error. Ext Err Code: 0
Memory Error 'mem-tx: generic read, tx: generic, level: L3/generic',
memory_die_id=0, cpu_type= AMD Scalable MCA, cpu= 2, socketid= 0, misc=
d01a000201000000, addr= 40, synd= a000000, ipid= 9600792f00,
mcgstatus=0, mcgcap= 140, apicid= 4
From logs, We can see "DRAM On Die ECC error" which is for Ext Err Code: 0
So, in rasdaemon Error strings are maintained.
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>
Powered by blists - more mailing lists