linux-kernel - Re: [PATCH] ghes: Track number of recovered hardware errors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250716083026.1737fdb4@foz.lan>
Date: Wed, 16 Jul 2025 08:30:26 +0200
From: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: Borislav Petkov <bp@...en8.de>, Breno Leitao <leitao@...ian.org>,
 Alexander Graf <graf@...zon.com>, Konrad Rzeszutek Wilk
 <konrad.wilk@...cle.com>, Peter Gonda <pgonda@...gle.com>, "Luck, Tony"
 <tony.luck@...el.com>, "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown
 <lenb@...nel.org>, James Morse <james.morse@....com>, "Moore, Robert"
 <robert.moore@...el.com>, "linux-acpi@...r.kernel.org"
 <linux-acpi@...r.kernel.org>, "linux-kernel@...r.kernel.org"
 <linux-kernel@...r.kernel.org>, "acpica-devel@...ts.linux.dev"
 <acpica-devel@...ts.linux.dev>, "kernel-team@...a.com"
 <kernel-team@...a.com>
Subject: Re: [PATCH] ghes: Track number of recovered hardware errors

Em Wed, 16 Jul 2025 10:05:27 +0800
Shuai Xue <xueshuai@...ux.alibaba.com> escreveu:

> 在 2025/7/15 23:09, Borislav Petkov 写道:
> > On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:  
> >> For the purpose of counting, how about using the cmdline of rasdaemon?  
> > 
> > That would mean you have to run rasdaemon on those machines before they
> > explode and then carve out the rasdaemon db from the coredump (this is
> > post-mortem analysis).  
> 
> Rasdaemon is a userspace tool that will collect all hardware error 
> events reported by the Linux Kernel from several sources (EDAC, MCE, 
> PCI, ...) into one common framework. And it has been a standard tools
> in Alibaba. As far as I know, twitter also use Rasdaemon in its production.

There are several others using rasdaemon, afaikt. It was originally
implemented due to a demand from supercomputer customers with thousands
of nodes in US, and have been shipped on major distros for quite a while.

> 
> > 
> > I would love for rasdaemon to log over the network and then other tools can
> > query those centralized logs but that has its own challenges...
> >   
> 
> I also prefer collecting rasdaemon data in a centralized data center, as 
> this is more beneficial for using big data analytics to analyze and 
> predict errors. At the same time, the centralized side also uses 
> rasdaemon logs as one of the references for machine operations and 
> maintenance.
> 
> As for rasdaemon itself, it is just a single-node event collector and 
> database, although it does also print logs. In practice, we use SLS [1] 
> to collect rasdaemon text logs from individual nodes and parse them on 
> the central side.

Well, rasdaemon already uses SQL commands to store on its SQLite database.

It shouldn't be hard to add a patch series to optionally use a centralized
database directly. My only concern is that delivering logs to an external
database on a machine that has hardware errors can be problematic and
eventually end losing events.

Also, supporting different databases can be problematic due to the
libraries they require. Last time I wrote a code to write to an Oracle
DB (a life-long time ago), the number of the libraries that were required
were huge. Also, changing the order with "-l" caused ld to not find the
right objects. It was messy. Ok, supporting MySQL and PostgreSQL is not
that hard.

Perhaps a good compromise would be to add a logic there to open a local
socket or a tcp socket with a logger daemon, sending the events asynchronously
after storing locally at SQLite. Then, write a Python script using SQLAlchemy. 
This way, we gain for free support for several different databases.

Thanks,
Mauro