linux-kernel - Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20201002180125.GD17436@zn.tnic>
Date:   Fri, 2 Oct 2020 20:02:18 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Shiju Jose <shiju.jose@...wei.com>
Cc:     James Morse <james.morse@....com>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "rjw@...ysocki.net" <rjw@...ysocki.net>,
        "lenb@...nel.org" <lenb@...nel.org>, Linuxarm <linuxarm@...wei.com>
Subject: Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on
 short time period

On Fri, Oct 02, 2020 at 06:33:17PM +0100, James Morse wrote:
> > I think adding the CPU error collection to the kernel
> > has the following advantages,
> >     1. The CPU error collection and isolation would not be active if the
> >          rasdaemon stopped running or not running on a machine.

Wasn't there this thing called systemd which promised that it would
restart daemons when they fail? And even if it is not there, you can
always do your own cronjob which checks rasdaemon presence and restarts
it if it has died and sends a mail to the admin to check why it had
died.

Everything else I've trimmed but James has put it a lot more eloquently
than me and I cannot agree more with what he says. Doing this in
userspace is better in every aspect you can think of.

The current CEC thing runs in the kernel because it has a completely
different purpose - to limit corrected error reports which turn into
very expensive support calls for errors which were corrected but people
simply don't get that they were corrected. Instead, they throw hands in
the air and go "OMG, my hardware is failing".

Where those are, as James says:

> These are corrected errors. Nothing has gone wrong.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette