linux-kernel - Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Wed, 9 Sep 2020 14:02:03 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     Shiju Jose <shiju.jose@...wei.com>
Cc:     "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-acpi@...r.kernel.org" <linux-acpi@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "tony.luck@...el.com" <tony.luck@...el.com>,
        "rjw@...ysocki.net" <rjw@...ysocki.net>,
        "james.morse@....com" <james.morse@....com>,
        "lenb@...nel.org" <lenb@...nel.org>, Linuxarm <linuxarm@...wei.com>
Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate
 an erroneous CPU core

On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote:
> CPU CEC derived the infrastructure of the CEC only and the logic
> used in the CEC for CE count storage, CE count calculation and page
> isolation is very unique for the memory pages, which seems cannot be
> reusable for the CPU CEs.

Oh, because it saves the reported error's PFN and you want to save

[CPU num | error count]

?

Well, you can easily change that by extending the existing CEC to have a
different storage format for CPU errors, i.e., use a different ce_array
which gets passed to the functions anyway.

> Also the values set for the parameters such as threshold, time period
> for the memory errors and CPU errors would be different.

And your implementation with sliding windows is so totally different
that it warrants the duplication of the code? I don't think so.

You can use the current CEC to do exactly what you wanna do, with the
decaying and so on.

Because all you wanna do is count the errors a CPU triggered.

However, a CPU can trigger a *lot* of different types of errors.
You're putting them all in the same basket by doing:

                else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
			/* add to CEC */

and only for correctable.

What type of errors get reported in CPER_SEC_PROC_ARM?

If they're all lumped together and if some functional unit generates a
lot of errors, instead of disabling that unit only, you'll go and remove
the whole CPU?

Doesn't make a whole lot of sense to me.

How about you define what exactly you're trying to solve, maybe give an
example of a real issue someone is encountering and you're trying to
address? Because there was never a necessity so far to disable CPUs on
x86 due to correctable errors. Why is that needed on ARM?

> Thus extending cec.c to support CPU CEs would include adding CPU CEC
> specific code for storing error count, isolation etc which I thought
> would result the code less tidy and less readable unless find more
> reusable logic.

Depends on how you design it.

But with what I'm seeing so far, I'm still sceptical this is needed at
all.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette