linux-kernel - Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <0a57d695-d671-4382-aa53-6517b1caf4a7@linux.alibaba.com>
Date: Tue, 26 Nov 2024 15:04:53 +0800
From: Shuai Xue <xueshuai@...ux.alibaba.com>
To: Borislav Petkov <bp@...en8.de>
Cc: keescook@...omium.org, tony.luck@...el.com, gpiccoli@...lia.com,
 rafael@...nel.org, lenb@...nel.org, james.morse@....com, tglx@...utronix.de,
 mingo@...hat.com, dave.hansen@...ux.intel.com, x86@...nel.org,
 hpa@...or.com, ardb@...nel.org, robert.moore@...el.com,
 linux-hardening@...r.kernel.org, linux-acpi@...r.kernel.org,
 linux-kernel@...r.kernel.org, linux-edac@...r.kernel.org,
 linux-efi@...r.kernel.org, acpica-devel@...ts.linuxfoundation.org,
 baolin.wang@...ux.alibaba.com
Subject: Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and
 APEI errors



在 2023/10/26 21:32, Borislav Petkov 写道:

Hi, Borislav,

Sorry for the late reply.

> On Sat, Oct 07, 2023 at 03:15:45PM +0800, Shuai Xue wrote:
>> So, IMHO, it's better to add a way to retrieve MCE records through switching
>> to the new generation rasdaemon solution.
> 
> rasdaemon already collects errors and even saves them in a database of
> sorts. No kernel changes needed.

I did not figure out how rasdaemon *already* collects errors.

Both rasdaemon and mcelog are designed to collect errors generated by the
x86_mce_decoder_chain notifier. However, due to the queuing of mce_irq_work on
the current CPU during an MCE context, the associated notifier_call is not
executed, preventing error collection before a system panic occurs. As a
result, neither rasdaemon nor mcelog can capture errors at this critical time.

Upon inspection, rasdaemon fails to record any errors, as evidenced by the
output of `ras-mc-ctl --errors`, which shows no memory or PCIe AER errors,
among others.

   # run after a new reboot caused by fatal memory error
   #ras-mc-ctl --errors
   No Memory errors.
   
   No PCIe AER errors.
   
   No Extlog errors.
   
   No devlink errors.
   
   No disk errors.
   
   No Memory failure errors.
   
   No MCE errors.

Conversely, mcelog is able to retrieve and log detailed MCE error records
post-reboot, providing valuable insights into hardware error events, even in
the case of fatal errors.

   #journalctl -u mcelog --no-pager
   -- Reboot --
   systemd[1]: Started Machine Check Exception Logging Daemon.
   mcelog[2783]: Running trigger `dimm-error-trigger' (reporter: memdb)
   mcelog[2783]: Hardware event. This is not a software error.
   mcelog[2783]: MCE 0
   mcelog[2783]: not finished?
   mcelog[2783]: CPU 0 BANK 16 TSC 2307d829a77
   mcelog[2783]: RIP !INEXACT! 10:ffffffffa9588d6b
   mcelog[2783]: MISC a0001201618f886 ADDR 1715d9880
   mcelog[2783]: TIME 1732588816 Tue Nov 26 10:40:16 2024
   mcelog[2783]: MCG status:RIPV MCIP
   mcelog[2783]: MCi status:
   mcelog[2783]: Uncorrected error
   mcelog[2783]: Error enabled
   mcelog[2783]: MCi_MISC register valid
   mcelog[2783]: MCi_ADDR register valid
   mcelog[2783]: Processor context corrupt
   mcelog[2783]: MCA: MEMORY CONTROLLER RD_CHANNEL1_ERR
   mcelog[2783]: Transaction: Memory read error
   mcelog[2783]: MemCtrl: Uncorrected read error
   mcelog[2783]: bank: 0x2 bankgroup: 0x1 row: 0x402c3 column: 0x1f0
   mcelog[2783]: rank: 0x2 subrank: 0x0
   mcelog[2783]: ecc mode: SDDC
   mcelog[2783]: STATUS be00000200a00091 MCGSTATUS 5
   mcelog[2783]: MCGCAP f000c15 APICID 0 SOCKETID 0
   mcelog[2783]: PPIN 74f8640abf43c587
   mcelog[2783]: MICROCODE 2b000571
   mcelog[2783]: CPUID Vendor Intel Family 6 Model 143 Step 4

This patchset is based on the fact that we can not collect the MCE records
which are written to persistent storage if we switch to rasdaemon. Please
correct me if I missed anything.

> 
>> Sorry for the poor cover letter. I hope the following response can clarify
>> the matter.
>>
>> Q1: What is the exact problem?
>>
>> Traditionally, fatal hardware errors will cause Linux print error log to
>> console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With
>> Linux, the primary method for obtaining debugging information of a serious
>> error or fault is via the kdump mechanism.
> 
> Not necessarily - see above.
> 
>> In the public cloud scenario, multiple virtual machines run on a
>> single physical server, and if that server experiences a failure, it can
>> potentially impact multiple tenants. It is crucial for us to thoroughly
>> analyze the root causes of each instance failure in order to:
>>
>> - Provide customers with a detailed explanation of the outage to reassure them.
>> - Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction.
>> - Explore potential solutions to prevent widespread outages.
> 
> Huh, are you talking about providing customers with error information
> from the *underlying* physical machine which runs the cloud VMs? That
> sounds suspicious, to say the least.
> 
> AFAICT, all you can tell the VM owner is: yah, the hw had an
> uncorrectable error in its memory and crashed. Is that the use case?

Yes, I mean that the MCE record is a important evidence to dig out the root
cause for every panic in production to aovid suffering potential wildly
outages, so we want to collect as many error logs as possible.

> 
> To be able to tell the VM owners why it crashed?
> 
>> In short, it is necessary to serialize hardware error information available
>> for post-mortem debugging.
>>
>> Q2: What exactly I wanna do:
>>
>> The MCE handler, do_machine_check(), saves the MCE record to persistent
>> storage and it is retrieved by mcelog. Mcelog has been deprecated when
>> kernel 4.12 released in 2017, and the help of the configuration option
>> CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new
>> generation rasdaemon solution. The GHES handler does not support APEI error
>> record now.
> 
> I think you're confusing things: MCEs do get reported to userspace
> through the trace_mc_record tracepoint and rasdaemon opens it and reads
> error info from there. And then writes it out to its db. So that works
> now.

For recoverable errors, MCEs are recorded in rasdaemon by the trace_mc_record
tracepoint. But not for fatal errors. See my experiment above.

> 
> GHES is something different: it is a fw glue around error reporting so
> that you don't have to develop a reporting driver for every platform but
> you can use a single one - only the fw glue needs to be added.
> 
> The problem with GHES is that it is notoriously buggy and currently
> it loads on a single platform only on x86.

As far as I know, GHES is wildly used on ARM platfrom and it is the primary
method to dliver error record from firmware to OS.

> 
> ARM are doing something in that area - you're better off talking to
> James Morse about it. And he's on Cc.

Thanks.

> 
>> To serialize hardware error information available for post-mortem
>> debugging:
>> - add support to save APEI error record into flash via ERST before go panic,
>> - add support to retrieve MCE or APEI error record from the flash and emit
>> the related tracepoint after system boot successful again so that rasdaemon
>> can collect them
> 
> Now that is yet another thing: you want to save error records into
> firmware. First of all, you don't really need it if you do kdump as
> explained above.
> 
> Then, that thing has its own troubles: it is buggy like every firmware
> is and it can brick the machine.
> 
> I'm not saying it is not useful - there are some use cases for it which
> are being worked on but if all you wanna do is dump MCEs to rasdaemon,
> that works even now.
> 
> But then you have an ARM patch there and I'm confused because MCEs are
> x86 thing - ARM has different stuff.
> 
> So I think you need to elaborate more here.


Yes, may I need to split this patchset into two parts.

> 
> Thx.
> 


Thanks for valuable comments.

Best Regards,
Shuai