[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAChUvXNxcpHDOn1Rkxwp8VbuN5-k08JfrgwdbN95uvMWVh1Rwg@mail.gmail.com>
Date: Wed, 5 Dec 2018 15:59:46 -0600
From: Tracy Smith <tlsmith3777@...il.com>
To: york.sun@....com
Cc: bp@...en8.de, linux-edac@...r.kernel.org,
util-linux@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>
Subject: Patrol scrub questions
>Single-bit errors are corrected by memory controller without involving software.
Sorry for being verbose, but I need to explain the reason for the
questions below since I need to determine if a memory scrub is
required on layerscape and why. There are multiple layers to the
problem of ECC.
First layer, there is the immediate 'correction' of a flipped bit.
This does not 'fix' the source of the error but corrects the flipped
bit for use by the processor.
Most bit flips will be due to either a transitory noise problem on the
bus, which will not be associated with any given memory cell, OR it
will be due to a cosmic-ray induced bit flip in the memory cell which
will stay 'flipped' until the location has been written to again.
The safe action is to write the ECC corrected data back to the same
'error' location in memory. Does the layerscape memory controller
without software intervention do this?
Question 1) Does the layerscape memory controller automatically
perform a write of the corrected data back to the 'error' location to
make a correction? If not, is a patrol scrub required to do this?
Second layer, there is the risk of a double bit flip in memory.
Statistically this is very rare, but the odds significantly increase
that a double bit flip will occur in a single word when a single bit
flip goes uncorrected, giving more time for another cosmic ray induced
bit flip to occur in that word.
The layerscape memory controller can only detect a bit-flip when a
given location is read, correct? This is different from normal DRAM
refresh routines.
If a location is not normally read, it can go 'unserviced'
indefinitely, allowing multiple bit flips to accumulate.
By periodically (once a day should be more than sufficient overkill)
reading each location in the DRAM and writing that same (automatically
ECC corrected if correction was needed) value back into the DRAM, we
drastically reduce the potential for an uncorrectable multiple bit
error to accumulate in any given word in memory.
Question 2) Again this would require the EDAC layerscape driver to do
a control scrub, correct? If not, how is this handled by the memory
controller to avoid the need for a patrol scrub?
Third layer, there is how the memory controller handles UE errors. My
understanding is that the layerscape memory controller, can detect if
it is a single bit (correctable) error or a multi-bit error that is
not correctable. Is this the case?
An uncorrectable error in the data or the software will have
consequences ranging from negligible to critical. From a hardware
standpoint it can't tell if it is critical so it must assume it is.
Question 3) Because the memory controller or layerscape platform must
assume a UE is critical, will a single UE on layersape cause a WDT to
be triggered and a reset to occur?
Question 4) If so, will a panic ever be called if there is a hardware
uncorrectable memory failure?
Powered by blists - more mailing lists