[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <20240624044839.87035-1-qirui.001@bytedance.com>
Date: Mon, 24 Jun 2024 12:48:39 +0800
From: Rui Qi <qirui.001@...edance.com>
To: tony.luck@...el.com
Cc: bp@...en8.de,
dave.hansen@...ux.intel.com,
hpa@...or.com,
linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org,
mingo@...hat.com,
qirui.001@...edance.com,
tglx@...utronix.de,
x86@...nel.org
Subject: Re: [External] [PATCH] x86/mce: count the number of occurrences of each MCE severity
From: Rui Qi <qirui.001@...edance.com>
Hi Tony,
> You seem to have problems with the e-mail infrastructure. I got a few extra copies
> of this in HTML format. This one is in plain text, but the From: header says "$(name)"
>
Sorry, some problem with my thunderbird mail agent. I will use git send-mail instead from now on.
>
>>> So you either covered a case in the severities table, or you didn't. Does it
>>> help to know that you covered a case multiple times?
>>>
>>
>> In the fault injection test in the laboratory, we inject errors multiple
>> times and need a counter to tell us how many times each case has
>> occurred and compare it with the expected number to determine the test
>> results
>
> In my testing on Intel/x86 I don't always see a 1:1 mapping between my
> test, and the severities rule. This is because of a h/w race between the
> memory controller reporting the error when it sees an uncorrectable ECC
> issue, and the core trying to consume the poisoned data. If the memory
> controller signal wins the race, Linux takes the page offline and there isn't
> a poison consumption error, just a page fault.
>
>> In the production environment, the counter can reflect the actual number
>> of times each MCE error type occurs, which can help us detect the MCE
>> error distribution of large-scale Data center infrastructure
>
> That could be useful.
>
Thank you for your expertise!
>>>> Due to the limitation of char type, the maximum supported statistics are
>>>> currently 255 times
>>>>
>>
>> How about changing char to u64, which is enough for real-world
>> situations and won't waste a lot of memory?
>
> u64 seems like serious overkill. A change from "unsigned char" to "unsigned int"
> would keep track of 4 billion errors. That seems like plenty :-)
>> -Tony
Yes, unsinged int is far enough.
BTW, if you dont mind, I will send a V2 patch based on our discussion.
Powered by blists - more mailing lists