[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z/f04bEJAUvMCzXC@redbud>
Date: Thu, 10 Apr 2025 11:42:09 -0500
From: "Tyler Hicks (Microsoft)" <code@...icks.com>
To: Marc Zyngier <maz@...nel.org>
Cc: Krzysztof Kozlowski <krzk@...nel.org>,
Vijay Balakrishna <vijayb@...ux.microsoft.com>,
Borislav Petkov <bp@...en8.de>, Tony Luck <tony.luck@...el.com>,
James Morse <james.morse@....com>,
Mauro Carvalho Chehab <mchehab@...nel.org>,
Robert Richter <rric@...nel.org>, linux-edac@...r.kernel.org,
linux-kernel@...r.kernel.org, Sascha Hauer <s.hauer@...gutronix.de>
Subject: Re: [PATCH 2/2] dt-bindings: arm: cpus: Add edac-enabled property
On 2025-04-10 17:23:26, Marc Zyngier wrote:
> On Thu, 10 Apr 2025 15:30:17 +0100,
> "Tyler Hicks (Microsoft)" <code@...icks.com> wrote:
> >
> > On 2025-04-10 08:10:18, Marc Zyngier wrote:
> > > On Thu, 10 Apr 2025 07:00:55 +0100,
> > > Krzysztof Kozlowski <krzk@...nel.org> wrote:
> > > >
> > > > On 10/04/2025 01:36, Vijay Balakrishna wrote:
> > > > > From: Sascha Hauer <s.hauer@...gutronix.de>
> > > > >
> > > > > Some ARM Cortex CPUs like the A53, A57 and A72 have Error Detection And
> > > > > Correction (EDAC) support on their L1 and L2 caches. This is implemented
> > > > > in implementation defined registers, so usage of this functionality is
> > > > > not safe in virtualized environments or when EL3 already uses these
> > > > > registers. This patch adds a edac-enabled flag which can be explicitly
> > > > > set when EDAC can be used.
> > > >
> > > > Can't hypervisor tell you that?
> > >
> > > No, it can't. This is not an architecture feature, and KVM will gladly
> > > inject an UNDEF exception if the guest tries to use this.
> > >
> > > Which is yet another reason why this whole exercise is futile.
> >
> > Hi Marc - could you clarify why this is futile for baremetal or were you just
> > referring to virtualized environments?
>
> This is futile in general. This sort of stuff only makes sense if you
> can take useful action upon detecting an error, such as cache
> scrubbing. Here, this is just telling you "bang, you're dead", without
> any other recourse. You are not even sure you'll be able to actually
> *run* this code. You cannot identify what the blast radius.
We want to use it for monitoring purposes to let us know when a system needs to
be replaced. Knowing the number of Correctable Errors that a specific system is
encountering will help prioritize the replacement of that faulty system.
Also, if we can find some breadcrumbs of an Uncorrectable Error (UE) occurring
just before an important process crashes or before the kernel crashing, then we
can avoid expensive manual debugging and simply replace the system. Automation
can be implemented to dig through the kernel core dump contents to look for a
UE log message from this driver and a kernel engineer will never have to look
at the dump.
> We have some other EDAC implementation for arm64 CPUs (XGene,
> ThunderX), and they are all perfectly useless (I have them in my
> collection of horrors). I know you are familiar enough with the RAS
> architecture to appreciate the difference with a contemporary
> implementation that would actually do the right thing.
Yes, those are nice luxuries to have in the newer implementations but there are
still a lot of older systems in use and making do with what capabilities the
older hardware provides is still useful.
Tyler
>
> Thanks,
>
> M.
>
> --
> Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists