linux-kernel - Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180824120102.GB29751@nazgul.tnic>
Date:   Fri, 24 Aug 2018 14:01:02 +0200
From:   Borislav Petkov <bp@...en8.de>
To:     James Morse <james.morse@....com>
Cc:     Tyler Baicar <baicar.tyler@...il.com>,
        Tyler Baicar <tbaicar@...eaurora.org>, wufan@...eaurora.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        harba@....qualcomm.com, mchehab@...nel.org,
        arm-mail-list <linux-arm-kernel@...ts.infradead.org>,
        linux-edac@...r.kernel.org
Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM

On Fri, Aug 24, 2018 at 10:48:24AM +0100, James Morse wrote:
> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
> EDAC_MC_LAYER_SLOT is for?

Yap.

> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
> read what it does with this information yet).

See edac_inc_ce_error(), for example - it uses the layers which are not
negative (-1) to increment the error counts of the respective layer. It
all depends on what granularity of the hardware part you're reporting
the error for: is it a DIMM rank, a whole DIMM or for a channel which
can span multiple DIMM ranks. And so on...

Look at some of the drivers and how they're doing that layering. It all
depends on whether you can get the precise info from the hw.

> ghes_edac_report_mem_error() does check CPER_MEM_VALID_MODULE_HANDLE, and if its
> set, it uses the handle to find the bank/device strings and prints them out.

Yap, and the error counts are lumped together into

  /sys/devices/system/edac/mc/mc*/ce_noinfo_count

> Naively I thought we could generate some index during ghes_edac_count_dimms(),
> and use this as e->${whichever}_layer. I hoped there would be something we could
> already use as the index, but I can't spot it, so this will be more than the
> one-liner I was hoping for!

If you can get that info from the hardware and injecting an error into
a DIMM gives you the correct DIMM number so that we can increment the
proper counter, then you're golden. I don't think that works reliably on
x86, though, therefore the lumping together.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--