[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250716083026.1737fdb4@foz.lan>
Date: Wed, 16 Jul 2025 08:30:26 +0200
From: Mauro Carvalho Chehab <mchehab+huawei@...nel.org>
To: Shuai Xue <xueshuai@...ux.alibaba.com>
Cc: Borislav Petkov <bp@...en8.de>, Breno Leitao <leitao@...ian.org>,
Alexander Graf <graf@...zon.com>, Konrad Rzeszutek Wilk
<konrad.wilk@...cle.com>, Peter Gonda <pgonda@...gle.com>, "Luck, Tony"
<tony.luck@...el.com>, "Rafael J. Wysocki" <rafael@...nel.org>, Len Brown
<lenb@...nel.org>, James Morse <james.morse@....com>, "Moore, Robert"
<robert.moore@...el.com>, "linux-acpi@...r.kernel.org"
<linux-acpi@...r.kernel.org>, "linux-kernel@...r.kernel.org"
<linux-kernel@...r.kernel.org>, "acpica-devel@...ts.linux.dev"
<acpica-devel@...ts.linux.dev>, "kernel-team@...a.com"
<kernel-team@...a.com>
Subject: Re: [PATCH] ghes: Track number of recovered hardware errors
Em Wed, 16 Jul 2025 10:05:27 +0800
Shuai Xue <xueshuai@...ux.alibaba.com> escreveu:
> 在 2025/7/15 23:09, Borislav Petkov 写道:
> > On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:
> >> For the purpose of counting, how about using the cmdline of rasdaemon?
> >
> > That would mean you have to run rasdaemon on those machines before they
> > explode and then carve out the rasdaemon db from the coredump (this is
> > post-mortem analysis).
>
> Rasdaemon is a userspace tool that will collect all hardware error
> events reported by the Linux Kernel from several sources (EDAC, MCE,
> PCI, ...) into one common framework. And it has been a standard tools
> in Alibaba. As far as I know, twitter also use Rasdaemon in its production.
There are several others using rasdaemon, afaikt. It was originally
implemented due to a demand from supercomputer customers with thousands
of nodes in US, and have been shipped on major distros for quite a while.
>
> >
> > I would love for rasdaemon to log over the network and then other tools can
> > query those centralized logs but that has its own challenges...
> >
>
> I also prefer collecting rasdaemon data in a centralized data center, as
> this is more beneficial for using big data analytics to analyze and
> predict errors. At the same time, the centralized side also uses
> rasdaemon logs as one of the references for machine operations and
> maintenance.
>
> As for rasdaemon itself, it is just a single-node event collector and
> database, although it does also print logs. In practice, we use SLS [1]
> to collect rasdaemon text logs from individual nodes and parse them on
> the central side.
Well, rasdaemon already uses SQL commands to store on its SQLite database.
It shouldn't be hard to add a patch series to optionally use a centralized
database directly. My only concern is that delivering logs to an external
database on a machine that has hardware errors can be problematic and
eventually end losing events.
Also, supporting different databases can be problematic due to the
libraries they require. Last time I wrote a code to write to an Oracle
DB (a life-long time ago), the number of the libraries that were required
were huge. Also, changing the order with "-l" caused ld to not find the
right objects. It was messy. Ok, supporting MySQL and PostgreSQL is not
that hard.
Perhaps a good compromise would be to add a logic there to open a local
socket or a tcp socket with a logger daemon, sending the events asynchronously
after storing locally at SQLite. Then, write a Python script using SQLAlchemy.
This way, we gain for free support for several different databases.
Thanks,
Mauro
Powered by blists - more mailing lists