linux-kernel - Re: [External] [PATCH] x86/mce: count the number of occurrences of each MCE severity

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <20240624044839.87035-1-qirui.001@bytedance.com>
Date: Mon, 24 Jun 2024 12:48:39 +0800
From: Rui Qi <qirui.001@...edance.com>
To: tony.luck@...el.com
Cc: bp@...en8.de,
	dave.hansen@...ux.intel.com,
	hpa@...or.com,
	linux-edac@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	mingo@...hat.com,
	qirui.001@...edance.com,
	tglx@...utronix.de,
	x86@...nel.org
Subject: Re: [External] [PATCH] x86/mce: count the number of occurrences of each MCE severity

From: Rui Qi <qirui.001@...edance.com>

Hi Tony,

> You seem to have problems with the e-mail infrastructure. I got a few extra copies
> of this in HTML format. This one is in plain text, but the From: header says "$(name)"
> 

Sorry， some problem with my thunderbird mail agent. I will use git send-mail instead from now on.
> 
>>> So you either covered a case in the severities table, or you didn't. Does it
>>> help to know that you covered a case multiple times?
>>>
>>
>> In the fault injection test in the laboratory, we inject errors multiple
>> times and need a counter to tell us how many times each case has
>> occurred and compare it with the expected number to determine the test
>> results
> 
> In my testing on Intel/x86 I don't always see a 1:1 mapping between my
> test, and the severities rule. This is because of a h/w race between the
> memory controller reporting the error when it sees an uncorrectable ECC
> issue, and the core trying to consume the poisoned data. If the memory
> controller signal wins the race, Linux takes the page offline and there isn't
> a poison consumption error, just a page fault.
> 
>> In the production environment, the counter can reflect the actual number
>> of times each MCE error type occurs, which can help us detect the MCE
>> error distribution of large-scale Data center infrastructure
> 
> That could be useful.
>
Thank you for your expertise!

>>>> Due to the limitation of char type, the maximum supported statistics are
>>>> currently 255 times
>>>>
>>
>> How about changing char to u64, which is enough for real-world
>> situations and won't waste a lot of memory?
> 
> u64 seems like serious overkill. A change from "unsigned char" to "unsigned int"
> would keep track of 4 billion errors. That seems like plenty :-)
>> -Tony
Yes, unsinged int is far enough.

BTW, if you dont mind, I will send a V2 patch based on our discussion.