linux-kernel - Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use printk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20210506091630.168c7887@coco.lan>
Date:   Thu, 6 May 2021 09:16:30 +0200
From:   Mauro Carvalho Chehab <mchehab@...nel.org>
To:     Tyler Hicks <tyhicks@...ux.microsoft.com>
Cc:     Borislav Petkov <bp@...en8.de>, wangglei <wangglei@...il.com>,
        "Lei Wang (DPLAT)" <Wang.Lei@...rosoft.com>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "james.morse@....com" <james.morse@....com>,
        "rric@...nel.org" <rric@...nel.org>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Hang Li <hangl@...rosoft.com>,
        Brandon Waller <bwaller@...rosoft.com>
Subject: Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use
 printk_ratelimited.

Em Wed, 5 May 2021 18:01:52 -0500
Tyler Hicks <tyhicks@...ux.microsoft.com> escreveu:

> On 2021-05-06 00:55:00, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:  
> > > This is x86-specific   
> > 
> > That's because it is used by x86 currently. It shouldn't be hard to use
> > it on another arch though as the machinery is pretty generic.
> >   
> > > and not applicable in our situation.  
> > 
> > What is your situation? ARM?  
> 
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :) 

Well, if a machine starts to produce 500+ errors per second,
then it should be offlined as soon as possible, as otherwise bad results
will be produced ;-)

See, the CE error reporting mechanism is meant to be used together
with some error correction code algorithm like the ones used on ECC
memories. Such algorithms are designed to detect a single errored bit 
with a change usually at the ~10⁻4 to 10^-7 order (the precision
depends on how many bits are used and what algorithm is used), but 
if there are two wrong bits at the same word, the chance to detect 
is a lot lower.

So, keeping the server enabled up to the point that it would consume
enough resources at the storage/network to bother someone sounds a 
terrible idea, as sooner or later it will miss an error or produce
an uncorrected event ;-)

Besides that, if you're running rasdaemon to capture the hardware errors, 
the storage will also be flooded by something like that, even if you
disable them from going to syslog via 
sys/module/edac_core/parameters/edac_mc_log_ce.

Now, the question is: are those 500+ errors per second a real hardware
problem, or is it due to some broken error report mechanism?

In the latter case, the driver or the hardware that it is producing 
invalid errors should be fixed.

> 
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.

Regards,
Mauro