[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20210506091630.168c7887@coco.lan>
Date: Thu, 6 May 2021 09:16:30 +0200
From: Mauro Carvalho Chehab <mchehab@...nel.org>
To: Tyler Hicks <tyhicks@...ux.microsoft.com>
Cc: Borislav Petkov <bp@...en8.de>, wangglei <wangglei@...il.com>,
"Lei Wang (DPLAT)" <Wang.Lei@...rosoft.com>,
"tony.luck@...el.com" <tony.luck@...el.com>,
"james.morse@....com" <james.morse@....com>,
"rric@...nel.org" <rric@...nel.org>,
"linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Hang Li <hangl@...rosoft.com>,
Brandon Waller <bwaller@...rosoft.com>
Subject: Re: [EXTERNAL] Re: [PATCH] EDAC: update edac printk wrappers to use
printk_ratelimited.
Em Wed, 5 May 2021 18:01:52 -0500
Tyler Hicks <tyhicks@...ux.microsoft.com> escreveu:
> On 2021-05-06 00:55:00, Borislav Petkov wrote:
> > On Wed, May 05, 2021 at 05:43:57PM -0500, Tyler Hicks wrote:
> > > This is x86-specific
> >
> > That's because it is used by x86 currently. It shouldn't be hard to use
> > it on another arch though as the machinery is pretty generic.
> >
> > > and not applicable in our situation.
> >
> > What is your situation? ARM?
>
> Yes, though I'm not sure if those additional features are
> important/useful enough for us to generalize that driver. The main
> motivation here was just to prevent storage/network from being flooded
> by obviously-bad nodes that haven't been offlined yet. :)
Well, if a machine starts to produce 500+ errors per second,
then it should be offlined as soon as possible, as otherwise bad results
will be produced ;-)
See, the CE error reporting mechanism is meant to be used together
with some error correction code algorithm like the ones used on ECC
memories. Such algorithms are designed to detect a single errored bit
with a change usually at the ~10⁻4 to 10^-7 order (the precision
depends on how many bits are used and what algorithm is used), but
if there are two wrong bits at the same word, the chance to detect
is a lot lower.
So, keeping the server enabled up to the point that it would consume
enough resources at the storage/network to bother someone sounds a
terrible idea, as sooner or later it will miss an error or produce
an uncorrected event ;-)
Besides that, if you're running rasdaemon to capture the hardware errors,
the storage will also be flooded by something like that, even if you
disable them from going to syslog via
sys/module/edac_core/parameters/edac_mc_log_ce.
Now, the question is: are those 500+ errors per second a real hardware
problem, or is it due to some broken error report mechanism?
In the latter case, the driver or the hardware that it is producing
invalid errors should be fixed.
>
> Lei and others on cc will need to evaluate porting cec.c and what it
> will gain them. Thanks again.
Regards,
Mauro
Powered by blists - more mailing lists