lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <a5ab1099-fd08-c708-5532-21dc2a622695@molgen.mpg.de>
Date:   Fri, 14 Apr 2023 11:26:27 +0200
From:   Paul Menzel <pmenzel@...gen.mpg.de>
To:     Borislav Petkov <bp@...en8.de>
Cc:     Thomas Gleixner <tglx@...utronix.de>,
        Ingo Molnar <mingo@...hat.com>,
        Dave Hansen <dave.hansen@...ux.intel.com>, x86@...nel.org,
        LKML <linux-kernel@...r.kernel.org>,
        Yazen Ghannam <yazen.ghannam@....com>
Subject: Re: AMD EPYC 25 (19h): Hardware Error: Machine Check: 0 Bank 17:
 d42040000000011b

Dear Borislav,


Thank you for your quick and helpful reply.

Am 12.04.23 um 18:32 schrieb Borislav Petkov:
> On Wed, Apr 12, 2023 at 05:11:26PM +0200, Paul Menzel wrote:
>> On a Dell PowerEdge R7525 with AMD EPYC 7763 64-Core Processor, Linux
>> 5.15.94 logs the machine check exceptions (MCE) below:
>>
>> ```
>> [5154053.127240] mce: [Hardware Error]: Machine check events logged
>> [5154053.133711] mce: [Hardware Error]: CPU 3: Machine Check: 0 Bank 17: d42040000000011b
>> [5154053.141948] mce: [Hardware Error]: TSC 0 ADDR b3cbdbbc0 PPIN 2b615bef7f48098 SYND 6bd210000a801002 IPID 9600650f00
> 
> Build the latest kernel with CONFIG_X86_MCE_INJECT and
> CONFIG_EDAC_DECODE_MCE enabled and CONFIG_RAS_CEC *disabled*. Then boot
> it on that machine with and do the following below.
> 
> The files are in debugfs:
> 
> /sys/kernel/debug/mce-inject/
> ├── addr
> ├── bank
> ├── cpu
> ├── flags
> ├── ipid
> ├── misc
> ├── README
> ├── status
> └── synd
> 
> so you go and do
> 
> echo 0xd42040000000011b > status
> echo 0xb3cbdbbc0 > addr
> echo 3 > cpu
> echo "sw" > flags
> echo 0x6bd210000a801002 > synd
> echo 0x9600650f00 > ipid
> echo 17 > bank
> 
> Remember to keep the bank write last because this one injects the error.
> 
> It should dump the decoded error in dmesg.

Yes, that worked:

```
[  436.584741] mce: [Hardware Error]: Machine check events logged
[  436.590638] [Hardware Error]: Corrected error, no action required.
[  436.596869] [Hardware Error]: CPU:3 (19:1:1) 
MC17_STATUS[Over|CE|-|AddrV|-|SyndV|CECC|-|-|-]: 0xd42040000000011b
[  436.607083] [Hardware Error]: Error Addr: 0x0000000b3cbdbbc0
[  436.612763] [Hardware Error]: IPID: 0x0000009600650f00, Syndrome: 
0x6bd210000a801002
[  436.620569] [Hardware Error]: Unified Memory Controller Ext. Error 
Code: 0, DRAM ECC error.
[  436.628942] [Hardware Error]: cache level: L3/GEN, tx: GEN, mem-tx: RD
```

It says “no action required”, but out of the identical 14 servers with 
the same workload this is the only one having shown this errors three times.

Maybe the DIMM at bank 17 should just be replaced.

[…]


Kind regards,

Paul

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ