linux-kernel - Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <91e71fe9-b002-0f1f-3237-62cea49e083a@arm.com>
Date:   Thu, 1 Oct 2020 18:16:03 +0100
From:   James Morse <james.morse@....com>
To:     Borislav Petkov <bp@...en8.de>, Shiju Jose <shiju.jose@...wei.com>
Cc:     "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "rjw@...ysocki.net" <rjw@...ysocki.net>,
        "lenb@...nel.org" <lenb@...nel.org>, Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
 an erroneous CPU core

Hi guys,

On 17/09/2020 09:40, Borislav Petkov wrote:
> On Thu, Sep 10, 2020 at 03:29:56PM +0000, Shiju Jose wrote:

> You can't know what exactly you wanna do if you don't have a use case
> you're trying to address.
> 
>> According to the ARM Processor CPER definition the error types
>> reported are Cache Error, TLB Error, Bus Error and micro-architectural
>> Error.
> 
> Bus error sounds like not even originating in the CPU but the CPU only
> reporting it. Imagine if that really were the case, and you go disable
> the CPU but the error source is still there. You've just disabled the
> reporting of the error only and now you don't even know anymore that
> you're getting errors.
> 
>> Few thoughts on this,
>> 1. Not sure will a CPU core would work/perform as normal after disabling
>> a functional unit?
> 
> You can disable parts of caches, etc, so that you can have a somewhat
> functioning CPU until the replacement maintenance can take place.

This is implementation-specific stuff that only firmware can do...


>> 2. Support in the HW to disable a function unit alone may not available.
> 
> Yes.
> 
>> 3. If it is require to store and retrieve the error count based on
>> functional unit, then CEC will become more complex?
> 
> Depends on how it is designed. That's why we're first talking about what
> needs to be done exactly before going off and doing something.
> 
>> This requirement is the part of the early fault prediction by taking
>> action when large number of corrected errors reported on a CPU core
>> before it causing serious faults.
> 
> And do you know of actual real-life examples where this is really the
> case? Do you have any users who report a large error count on ARM CPUs,
> originating from the caches and that something like that would really
> help?
> 
> Because from my x86 CPUs limited experience, the cache arrays are mostly
> fine and errors reported there are not something that happens very
> frequently so we don't even need to collect and count those.
> 
> So is this something which you need to have in order to check a box
> somewhere that there is some functionality or is there an actual
> real-life use case behind it which a customer has requested?

If the corrected-count is available somewhere, can't this policy be made in user-space?


Thanks,

James