lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0a94db2a-2569-ac46-1a79-a05f46a4ea6f@arm.com>
Date:   Tue, 28 Aug 2018 18:09:24 +0100
From:   James Morse <james.morse@....com>
To:     Borislav Petkov <bp@...en8.de>
Cc:     Tyler Baicar <baicar.tyler@...il.com>,
        Tyler Baicar <tbaicar@...eaurora.org>, wufan@...eaurora.org,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        harba@....qualcomm.com, mchehab@...nel.org,
        arm-mail-list <linux-arm-kernel@...ts.infradead.org>,
        linux-edac@...r.kernel.org
Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM

Hi Boris,

On 24/08/18 13:01, Borislav Petkov wrote:
> On Fri, Aug 24, 2018 at 10:48:24AM +0100, James Morse wrote:
>> so edac_raw_mc_handle_error() has no clue where the error happened. (I haven't
>> read what it does with this information yet).
> 
> See edac_inc_ce_error(), for example - it uses the layers which are not
> negative (-1) to increment the error counts of the respective layer. It
> all depends on what granularity of the hardware part you're reporting
> the error for: is it a DIMM rank, a whole DIMM or for a channel which
> can span multiple DIMM ranks. And so on...
> 
> Look at some of the drivers and how they're doing that layering. It all
> depends on whether you can get the precise info from the hw.

Hmmm, in this example we need the information from firmware, as that is where
ghes-edac gets its information from.

We already count the module/device/dimms in the smbios table, memory is
described as 'EDAC_MC_LAYER_ALL_MEM' with num_dimms. I think all we're missing
is which dimm in ghes_edac_report_mem_error(). We have the handle, we just need
a number between 1 and num_dimms.

If it turns out firmware doesn't populate the handles in its cper records, then
we can keep e->enable_per_layer_report false when calling
edac_raw_mc_handle_error().

(I suggest we ignore 'card', and just do this for the device/dimms).


>> Naively I thought we could generate some index during ghes_edac_count_dimms(),
>> and use this as e->${whichever}_layer. I hoped there would be something we could
>> already use as the index, but I can't spot it, so this will be more than the
>> one-liner I was hoping for!
> 
> If you can get that info from the hardware and injecting an error into
> a DIMM gives you the correct DIMM number so that we can increment the
> proper counter, then you're golden. I don't think that works reliably on
> x86, though, therefore the lumping together.

... 'correct DIMM number' ...

Does x86 have another source of memory-topology information it needs to
correlate smbios with?

For arm there is nothing else describing the memory-topology, so as long as we
can correlate the smbios table and ghes:cper records through the handles, we can
get this working for all systems.


Thanks,

James

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ