linux-kernel - Re: [PATCH v7] ACPI / APEI: fix the regression of synchronous external aborts occur in user-mode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <CAJZ5v0jRrow5nXF3mXCVKerzaURKqDJBMp_PDfQDLF2OVpEeGA@mail.gmail.com>
Date:   Thu, 17 Jun 2021 14:07:01 +0200
From:   "Rafael J. Wysocki" <rafael@...nel.org>
To:     Xiaofei Tan <tanxiaofei@...wei.com>
Cc:     "Rafael J. Wysocki" <rafael@...nel.org>,
        James Morse <james.morse@....com>,
        "Rafael J. Wysocki" <rjw@...ysocki.net>,
        Len Brown <lenb@...nel.org>, Tony Luck <tony.luck@...el.com>,
        Borislav Petkov <bp@...en8.de>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Joerg Roedel <jroedel@...e.de>,
        Peter Zijlstra <peterz@...radead.org>,
        ACPI Devel Maling List <linux-acpi@...r.kernel.org>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        linuxarm@...neuler.org
Subject: Re: [PATCH v7] ACPI / APEI: fix the regression of synchronous
 external aborts occur in user-mode

On Tue, Jun 15, 2021 at 5:47 AM Xiaofei Tan <tanxiaofei@...wei.com> wrote:
>
> Hi Rafael,
>
> On 2021/6/14 23:46, Rafael J. Wysocki wrote:
> > On Fri, Jun 11, 2021 at 2:40 PM Xiaofei Tan <tanxiaofei@...wei.com> wrote:
> >>
> >> Before commit 8fcc4ae6faf8 ("arm64: acpi: Make apei_claim_sea()
> >> synchronise with APEI's irq work"), do_sea() would unconditionally
> >> signal the affected task from the arch code. Since that change,
> >> the GHES driver sends the signals.
> >>
> >> This exposes a problem as errors the GHES driver doesn't understand
> >> or doesn't handle effectively are silently ignored. It will cause
> >> the errors get taken again, and circulate endlessly. User-space task
> >> get stuck in this loop.
> >>
> >> Existing firmware on Kunpeng9xx systems reports cache errors with the
> >> 'ARM Processor Error' CPER records.
> >>
> >> Do memory failure handling for ARM Processor Error Section just like
> >> for Memory Error Section.
> >
> > Still, I'm not convinced that this is the right way to address the problem.
> >
> > In particular, is it guaranteed that "ARM Processor Error" will always
> > mean "memory failure" on all platforms?
> >
>
> There are two sources for ARM Processor cache errors(no second case for the platform that doesn't support poison mechanism).
> 1.occur in the cache. If it is transient, we have a chance to recover by doing memory failure.
> If it is persistent, we have to handle in other place, such as do cache way isolation in firmware,
> or trigger cpu core isolation in user space. I think most platform can't support such feature,
> so the most simple and effective way is report as fatal error and do isolation during firmware start-up phase.
>
> 2.error transferred from other RAS node. If it is from DDR, i think there is no doubt, and this is
> the most cases we met before.If it is from other place of SoC, such as internal SRAM(the probability is very little compare to DDR),
> the error is still in the hardware. But the RAS node that detected the SRAM error will also report the error.
>
> To sum up the above, it is effective for most situation, and no harm for the others.

OK, so applied as 5.14 material under edited subject.

Thanks!