linux-kernel - Re: [PATCH] EDAC, dmc520:: add DMC520 EDAC driver

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <7d404c61-a916-5d5e-35cc-0f6f016cd8a9@arm.com>
Date:   Wed, 23 Jan 2019 18:36:23 +0000
From:   James Morse <james.morse@....com>
To:     Rui Zhao <ruizhao@...rosoft.com>
Cc:     Sasha Levin <sashal@...nel.org>, "bp@...en8.de" <bp@...en8.de>,
        "mchehab@...nel.org" <mchehab@...nel.org>,
        "linux-edac@...r.kernel.org" <linux-edac@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Linux Kernel <linux-kernel@...rosoft.com>,
        "will.deacon@....com" <will.deacon@....com>,
        "okaya@...nel.org" <okaya@...nel.org>
Subject: Re: [PATCH] EDAC, dmc520:: add DMC520 EDAC driver

Hi Rui,

On 23/01/2019 00:42, Rui Zhao wrote:
> On Monday, January 21, 2019 9:09 AM, James Morse wrote:
>> It would be good if we can make this generic, so it works on all platforms with
>> a DMC520, possibly along with other components. (e.g. the a15 L2 driver posted
>> recently).
>>
>> This will mostly be getting the DT right, as we can refactor the code when a
>> second user comes along, but can't change the DT format.

> Agreed. It'd be good if we can move all platform specific settings to DT such
> that a second user can update the code and won't break the DT for original
> device. I'll think about the DT format to make it more generic.

When the time comes, could you post a dt-binding as the first patch? These add
the documentation under Documentation/device-tree/bindings, and need to be CC'd
to the DT folks.


>> The TRM describes 'a set of interrupts', which ones does the binding anticipate
>> are wired up for this? There are separate status bits for corrected and
>> uncorrected, and one pair for dram versus ram. (not sure what this corresponds
>> with).

>> Do we care about the link-error interrupt?

> Will add ram interrupts handling, and make it configurable in DT.

I don't think the code needs to change, we just need to make it clear these are
the dram interrupts, and they are different numbers.
This means someone can add the ram interrupts later, and not have to
disambiguate them to keep your platform going.

(its this stuff that the binding document describes)


> Could you please
> share a bit more info on link-error interrupt? TRM doesn't have detailed info about
> it.

I only have the TRM too! The section titled 'RAS' on page 2-23 talks about 'link
error protection', so it might be relevant so someone. We don't care about this
now, but if someone does in the future, we will need to have a way of adding it
to the DT.

> Would like to know what's the impact if this error happens, and how to fit it
> with current reporting in EDAC core.

At a guess the interrupt triggers when link_err_count increases. (link_err has
an overflow bit, so the interrupt must be related to a counter).

If we could associate a link with a layer in edac, we could report errors
against that point. But I've no idea how 'links' correspond with 'ranks and banks'!


>>> +     layers[1].type = EDAC_MC_LAYER_CHANNEL;
>>> +     layers[1].size = DMC520_EDAC_CHANS;
>>> +     layers[1].is_virt_csrow = false;

>> If you can read the rank count, why hard code the bank?
> 
>> (which is what I assume channels corresponds to ... although this has confused
>> me before [0]).

> I misunderstood what channel meant. For bank, we can read config from the register.

Its an assumption. Channel seems to be an overloaded term with different meanings.
Fortunately edac drivers get to pick what their layers mean. It looks like all
this controls is the names that come out of sysfs.

rank/bank are part of the address decoding, as we can read both sizes I think it
makes sense to use both as it should further localise the error.


>>> +     mci->scrub_cap = SCRUB_HW_SRC;
>>> +     mci->scrub_mode = SCRUB_NONE;

>> Is this saying the device supports error scrubbing, but its disabled?
>> Do we know that?

>> Can the user try and turn it on? (I can't find anything that reads this!)

> We don’t want to change what’s already configured.

I don't think we could if we wanted to. This thing has 'READY' and 'CONFIG'
modes, the scrubbing can only be written to in 'CONFIG' mode. I'm willing to bet
we need it to be in 'READY' to be executing the kernel from dram.


> Will change scrub_mode to HW.

Do we know its enabled? This is something firmware has to set up, someone else's
platform may do it differently.

I think should read one of the scrub control registers to find out if its turned on.

But, I can't find what uses this value ...


>>> +     /* Enable interrupts */
>>> +     dmc520_write_reg(edac,
>>> +                      DRAM_ECC_INT_CE_MASK | DRAM_ECC_INT_UE_MASK,
>>> +                      REG_OFFSET_INTERRUPT_CONTROL);

>> What if they were enabled before? (e.g. enabled by firmware, the bootloader or
>> kdump). If they're already enabled, can we race with the interrupt handler on
>> another CPU and get double reporting of the error counter?

> We don't expect interrupts to be enabled by firmware or bootloader if this driver
> is enabled.

What about a previous instance of this driver? Linux supports kexec, kdump and
hibernate, all of which cause us to inherit a slightly used platform.


> If firmware enables it, they're suppose to handle the interrupt.

Ah, so you still have resident firmware!
How come your firmware trusts linux not to turn off the memory controller?!
These things are usually protected by trust zone so the OS can't pull the memory
from under firmware's feet.


Thanks,

James