[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5de5637f-6572-4817-aab5-af60fc1c81bf@amd.com>
Date: Thu, 25 Jan 2024 14:27:16 -0600
From: "Naik, Avadhut" <avadnaik@....com>
To: "Luck, Tony" <tony.luck@...el.com>, Borislav Petkov <bp@...en8.de>
Cc: "linux-trace-kernel@...r.kernel.org"
<linux-trace-kernel@...r.kernel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"rostedt@...dmis.org" <rostedt@...dmis.org>, "x86@...nel.org"
<x86@...nel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "yazen.ghannam@....com"
<yazen.ghannam@....com>, Avadhut Naik <avadhut.naik@....com>
Subject: Re: [PATCH v2 0/2] Update mce_record tracepoint
Hi,
On 1/25/2024 1:19 PM, Luck, Tony wrote:
>>> The first patch adds PPIN (Protected Processor Inventory Number) field to
>>> the tracepoint.
>>>
>>> The second patch adds the microcode field (Microcode Revision) to the
>>> tracepoint.
>>
>> This is a lot of static information to add to *every* MCE.
>
> 8 bytes for PPIN, 4 more for microcode.
>
> Number of recoverable machine checks per system .... I hope the monthly rate should
> be countable on my fingers. If a system is getting more than that, then people should
> be looking at fixing the underlying problem.
>
> Corrected errors are much more common. Though Linux takes action to limit the
> rate when storms occur. So maybe hundreds or small numbers of thousands of
> error trace records? Increase in trace buffer consumption still measured in Kbytes
> not Mbytes. Server systems that do machine check reporting now start at tens of
> GBytes memory.
>
>> And where does it end? Stick full dmesg in the tracepoint too?
>
> Seems like overkill.
>
>> What is the real-life use case here?
>
> Systems using rasdaemon to track errors will be able to track both of these
> (I assume that Naik has plans to update rasdaemon to capture and save these
> new fields).
>
Yes, I do intend to submit a pull request to the rasdaemon to parse and log these
new fields.
> PPIN is useful when talking to the CPU vendor about patterns of similar errors
> seen across a cluster.
>
> MICROCODE - gives a fast path to root cause problems that have already
> been fixed in a microcode update.
>
> -Tony
--
Thanks,
Avadhut Naik
Powered by blists - more mailing lists