lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d4ee6d3a-b3ac-2986-40db-8423de7e960a@arm.com>
Date:   Tue, 28 Aug 2018 18:09:47 +0100
From:   James Morse <james.morse@....com>
To:     wufan <wufan@...eaurora.org>
Cc:     'Tyler Baicar' <baicar.tyler@...il.com>,
        'Tyler Baicar' <tbaicar@...eaurora.org>,
        'Linux Kernel Mailing List' <linux-kernel@...r.kernel.org>,
        harba@....qualcomm.com, 'Borislav Petkov' <bp@...en8.de>,
        mchehab@...nel.org,
        'arm-mail-list' <linux-arm-kernel@...ts.infradead.org>,
        linux-edac@...r.kernel.org
Subject: Re: [RFC PATCH] EDAC, ghes: Enable per-layer error reporting for ARM

Hi Fan,

On 24/08/18 15:30, wufan wrote:
>> Why get avoid the layer stuff? Isn't counting DIMM/memory-devices what
>> EDAC_MC_LAYER_SLOT is for?
> 
> Borislav has explained it in his response. Here let me elaborate a little more.
> To use the layer information you need an accurate way to pinpoint each component
> in the layer and the parent components in the layers above. For example, to use
> EDAC_MC_LAYER_SLOT you also need information for the parent layer say
> EDAC_MC_LAYER_CHANNEL, or another layer on top say EDAC_MC_LAYER_BRANCH.

I haven't spotted anything that forces a particular meaning/topology on these
types. (there are four of them, but only three 'levels')

i3000_edac.c has EDAC_MC_LAYER_CHIP_SELECT then EDAC_MC_LAYER_CHANNEL,
but i5400_edac.c has EDAC_MC_LAYER_BRANCH then EDAC_MC_LAYER_CHANNEL.
pnd2_edac.c agrees that channel comes before slot, but thunderx_edac.c just uses
slot directly....

ghes-edac already describes memory as 'EDAC_MC_LAYER_ALL_MEM', with num_dimms, I
think we just need to get 'the' dimm number. Using the cper-module-handle means
we don't have to worry about firmware's count of dimms being different, as we
both agree its smbios-type-17 we're talking about.


> There
> are no clear ways to get the information from SMBIOS table. In the case of "memory
> channel" we looked at type 37 which has the exact spelling but it was introduced
> to support RamBus and Synclink. Not sure we can readily use it for modern
> architecture concept of "channel/slot". 

I think we should avoid this 'channel' thing as it means different things to
different people!

We can use ghes:card smbios:physical-memory-array as the UEFI spec tells us they
are the same, and we don't actually need to know what they mean.


>  I think it is good enough if we can pin each error to the corresponding DIMM.
> At the end of the day DIMMs are what customer can replace in the memory system
> and that's all that they care about. For the manufacturers of the board/chips
> they have the knowledge to map the specific DIMMs to the upper layer components,
> so they can easily collect error counter data for upper layers. 

I agree.


>> CPER's "Memory Error Record 2" thinks that "NODE, CARD and MODULE
>> should provide the information necessary to identify the failing FRU". As
>> EDAC has three 'levels', these are what they should correspond to for ghes-
>> edac.
>>
>> I assume NODE means rack/chassis in some distributed system. Lets ignore it
>> as it doesn't seem to map to anything in the SMBIOS table.
> 
> How about type 4 "Processor Information"?

As the spec doesn't tell us what the field means, we can't really do anything
other than print the value out.


>> 'Card' doesn't mean much to me, but it maps to SMBIOS:17 "Memory Array
>> Structure", which the Memory Device structure also points to.
>> Card then must mean "a collection of memory devices (DIMMs) that operate
>> together to form an address space".
>>
>> This might be what I think of as a memory-controller, or it might be
>> something more complicated. Regardless, the CPER records think its relevant.
> 
> Originally I thought "Card" were memory channel. But looking at the definition
> of "Card Handle" in CPER: "... this field contains the SMBIOS handle for the
> Type 16 Memory Array Structure that represents the memory card". So Card is
> memory controller or something similar to that.
> Right now ghes-edac assumes
> one mc. We probably need to map mc(s) to the type 16 instances in SMBIOS table. 

I think we should ignore 'mc's, and just report the dimm numbers.

ghes-edac only cares about the number of dimms today, and this would work on
systems that only describe the dimms in the smbios table.


Thanks,

James

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ