linux-kernel - Re: [PATCH v2 0/2] Update mce

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5de5637f-6572-4817-aab5-af60fc1c81bf@amd.com>
Date: Thu, 25 Jan 2024 14:27:16 -0600
From: "Naik, Avadhut" <avadnaik@....com>
To: "Luck, Tony" <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>
Cc: "linux-trace-kernel@...r.kernel.org"
 <linux-trace-kernel@...r.kernel.org>,
 "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
 "rostedt@...dmis.org" <rostedt@...dmis.org>, "x86@...nel.org"
 <x86@...nel.org>, "linux-kernel@...r.kernel.org"
 <linux-kernel@...r.kernel.org>, "yazen.ghannam@....com"
 <yazen.ghannam@....com>, Avadhut Naik <avadhut.naik@....com>
Subject: Re: [PATCH v2 0/2] Update mce_record tracepoint

Hi,

On 1/25/2024 1:19 PM, Luck, Tony wrote:
>>> The first patch adds PPIN (Protected Processor Inventory Number) field to
>>> the tracepoint.
>>>
>>> The second patch adds the microcode field (Microcode Revision) to the
>>> tracepoint.
>>
>> This is a lot of static information to add to *every* MCE.
> 
> 8 bytes for PPIN, 4 more for microcode.
> 
> Number of recoverable machine checks per system .... I hope the monthly rate should
> be countable on my fingers. If a system is getting more than that, then people should
> be looking at fixing the underlying problem.
> 
> Corrected errors are much more common. Though Linux takes action to limit the
> rate when storms occur. So maybe hundreds or small numbers of thousands of
> error trace records? Increase in trace buffer consumption still measured in Kbytes
> not Mbytes. Server systems that do machine check reporting now start at tens of
> GBytes memory.
> 
>> And where does it end? Stick full dmesg in the tracepoint too?
> 
> Seems like overkill.
> 
>> What is the real-life use case here?
> 
> Systems using rasdaemon to track errors will be able to track both of these
> (I assume that Naik has plans to update rasdaemon to capture and save these
> new fields).
> 
Yes, I do intend to submit a pull request to the rasdaemon to parse and log these
new fields.

> PPIN is useful when talking to the CPU vendor about patterns of similar errors
> seen across a cluster.
> 
> MICROCODE - gives a fast path to root cause problems that have already
> been fixed in a microcode update.
> 
> -Tony

-- 
Thanks,
Avadhut Naik