[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJZ5v0jRrow5nXF3mXCVKerzaURKqDJBMp_PDfQDLF2OVpEeGA@mail.gmail.com>
Date: Thu, 17 Jun 2021 14:07:01 +0200
From: "Rafael J. Wysocki" <rafael@...nel.org>
To: Xiaofei Tan <tanxiaofei@...wei.com>
Cc: "Rafael J. Wysocki" <rafael@...nel.org>,
James Morse <james.morse@....com>,
"Rafael J. Wysocki" <rjw@...ysocki.net>,
Len Brown <lenb@...nel.org>, Tony Luck <tony.luck@...el.com>,
Borislav Petkov <bp@...en8.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Joerg Roedel <jroedel@...e.de>,
Peter Zijlstra <peterz@...radead.org>,
ACPI Devel Maling List <linux-acpi@...r.kernel.org>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
linuxarm@...neuler.org
Subject: Re: [PATCH v7] ACPI / APEI: fix the regression of synchronous
external aborts occur in user-mode
On Tue, Jun 15, 2021 at 5:47 AM Xiaofei Tan <tanxiaofei@...wei.com> wrote:
>
> Hi Rafael,
>
> On 2021/6/14 23:46, Rafael J. Wysocki wrote:
> > On Fri, Jun 11, 2021 at 2:40 PM Xiaofei Tan <tanxiaofei@...wei.com> wrote:
> >>
> >> Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea()
> >> synchronise with APEI's irq work"), do_sea() would unconditionally
> >> signal the affected task from the arch code. Since that change,
> >> the GHES driver sends the signals.
> >>
> >> This exposes a problem as errors the GHES driver doesn't understand
> >> or doesn't handle effectively are silently ignored. It will cause
> >> the errors get taken again, and circulate endlessly. User-space task
> >> get stuck in this loop.
> >>
> >> Existing firmware on Kunpeng9xx systems reports cache errors with the
> >> 'ARM Processor Error' CPER records.
> >>
> >> Do memory failure handling for ARM Processor Error Section just like
> >> for Memory Error Section.
> >
> > Still, I'm not convinced that this is the right way to address the problem.
> >
> > In particular, is it guaranteed that "ARM Processor Error" will always
> > mean "memory failure" on all platforms?
> >
>
> There are two sources for ARM Processor cache errors(no second case for the platform that doesn't support poison mechanism).
> 1.occur in the cache. If it is transient, we have a chance to recover by doing memory failure.
> If it is persistent, we have to handle in other place, such as do cache way isolation in firmware,
> or trigger cpu core isolation in user space. I think most platform can't support such feature,
> so the most simple and effective way is report as fatal error and do isolation during firmware start-up phase.
>
> 2.error transferred from other RAS node. If it is from DDR, i think there is no doubt, and this is
> the most cases we met before.If it is from other place of SoC, such as internal SRAM(the probability is very little compare to DDR),
> the error is still in the hardware. But the RAS node that detected the SRAM error will also report the error.
>
> To sum up the above, it is effective for most situation, and no harm for the others.
OK, so applied as 5.14 material under edited subject.
Thanks!
Powered by blists - more mailing lists