lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250301184724.GGZ8NWPI2Ys_BX-w2F@fat_crate.local>
Date: Sat, 1 Mar 2025 19:47:24 +0100
From: Borislav Petkov <bp@...en8.de>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: tony.luck@...el.com, nao.horiguchi@...il.com, tglx@...utronix.de,
	mingo@...hat.com, dave.hansen@...ux.intel.com, x86@...nel.org,
	hpa@...or.com, linmiaohe@...wei.com, akpm@...ux-foundation.org,
	peterz@...radead.org, jpoimboe@...nel.org,
	linux-edac@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-mm@...ck.org, baolin.wang@...ux.alibaba.com,
	tianruidong@...ux.alibaba.com
Subject: Re: [PATCH v2 2/5] x86/mce: dump error msg from severities

On Sat, Mar 01, 2025 at 10:03:13PM +0800, Shuai Xue wrote:
> (By the way, Cenots/Redhat build kernel without CONFIG_RAS_CEC set, becase
> it breaks EDAC decoding. We do not use CEC in production at all for the same
> reasion.)

It doesn't "break" error decoding - it collects every correctable DRAM error
and puts it in "leaky" bucket of sorts. And when a certain error address
generates too many errors, it memory_failure()s the page and poisons it.

You do not use it in production because you want to see every error, collect
it, massage it and perhaps decide when DIMMs go bad and you can replace
them... or whatever you do.

All the others who enable it and we can sleep properly, without getting
unnecessarily upset about a correctable error.

> Yes, we collect all kernel message from host, parse the logs and predict panic
> with AI tools. The more details we collect, the better the performance of
> the AI model.

LOL.

We go the great effort of going a MCE tracepoint which gives a *structured*
error record, show an example how to use
it in rasdaemon and you go and do the crazy hard and, at the same time, silly
thing and parse dmesg?!??!

This is priceless. Oh boy.

> Agreed, tracepoint is a more elegant way. However, it does not include error
> context, just some hardware registers.

The error context is in the behavior of the hw. If the error is fatal, you
won't see it - the machine will panic or do something else to prevent error
propagation. It definitely won't run any software anymore.

If you see the error getting logged, it means it is not fatal enough to kill
the machine.

> > Besides, this message is completely useless as it has no concrete info about
> > the error and what is being done about it.
> 
> I don't think so,

I think so and you're not reading my mail.

>     "mce: Uncorrected hardware memory error in user-access at 3b116c400"

Ask yourself: what can you do when you see a message like that?

Exactly *nothing* because there's not nearly enough information to recover
from it or log it or whatever. That error message is *totally useless* and
you're upsetting your users unnecessarily and even if they report it to you,
you can't help them.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ