[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <91e71fe9-b002-0f1f-3237-62cea49e083a@arm.com>
Date: Thu, 1 Oct 2020 18:16:03 +0100
From: James Morse <james.morse@....com>
To: Borislav Petkov <bp@...en8.de>, Shiju Jose <shiju.jose@...wei.com>
Cc: "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
"linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"tony.luck@...el.com" <tony.luck@...el.com>,
"rjw@...ysocki.net" <rjw@...ysocki.net>,
"lenb@...nel.org" <lenb@...nel.org>, Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
an erroneous CPU core
Hi guys,
On 17/09/2020 09:40, Borislav Petkov wrote:
> On Thu, Sep 10, 2020 at 03:29:56PM +0000, Shiju Jose wrote:
> You can't know what exactly you wanna do if you don't have a use case
> you're trying to address.
>
>> According to the ARM Processor CPER definition the error types
>> reported are Cache Error, TLB Error, Bus Error and micro-architectural
>> Error.
>
> Bus error sounds like not even originating in the CPU but the CPU only
> reporting it. Imagine if that really were the case, and you go disable
> the CPU but the error source is still there. You've just disabled the
> reporting of the error only and now you don't even know anymore that
> you're getting errors.
>
>> Few thoughts on this,
>> 1. Not sure will a CPU core would work/perform as normal after disabling
>> a functional unit?
>
> You can disable parts of caches, etc, so that you can have a somewhat
> functioning CPU until the replacement maintenance can take place.
This is implementation-specific stuff that only firmware can do...
>> 2. Support in the HW to disable a function unit alone may not available.
>
> Yes.
>
>> 3. If it is require to store and retrieve the error count based on
>> functional unit, then CEC will become more complex?
>
> Depends on how it is designed. That's why we're first talking about what
> needs to be done exactly before going off and doing something.
>
>> This requirement is the part of the early fault prediction by taking
>> action when large number of corrected errors reported on a CPU core
>> before it causing serious faults.
>
> And do you know of actual real-life examples where this is really the
> case? Do you have any users who report a large error count on ARM CPUs,
> originating from the caches and that something like that would really
> help?
>
> Because from my x86 CPUs limited experience, the cache arrays are mostly
> fine and errors reported there are not something that happens very
> frequently so we don't even need to collect and count those.
>
> So is this something which you need to have in order to check a box
> somewhere that there is some functionality or is there an actual
> real-life use case behind it which a customer has requested?
If the corrected-count is available somewhere, can't this policy be made in user-space?
Thanks,
James
Powered by blists - more mailing lists