lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Wed, 19 Jun 2019 18:22:37 +0100
From:   James Morse <james.morse@....com>
To:     "Hawa, Hanna" <hhhawa@...zon.com>
Cc:     robh+dt@...nel.org, mark.rutland@....com, bp@...en8.de,
        mchehab@...nel.org, davem@...emloft.net,
        gregkh@...uxfoundation.org, nicolas.ferre@...rochip.com,
        paulmck@...ux.ibm.com, dwmw@...zon.co.uk, benh@...zon.com,
        ronenk@...zon.com, talel@...zon.com, jonnyc@...zon.com,
        hanochu@...zon.com, linux-edac@...r.kernel.org,
        devicetree@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC

Hi Hawa,

On 17/06/2019 14:00, Hawa, Hanna wrote:
>> I don't think it can, on a second reading, it looks to be even more complicated than I
>> thought! That bit is described as disabling forwarding of uncorrected data, but it looks
>> like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush'
>> means in this context.)
>> I was looking for reasons you could 'know' that any reported error was corrected. This was
>> just a bad suggestion!

> Is there interrupt for un-correctable error?

The answer here is somewhere between 'not really' and 'maybe'.
There is a signal you may have wired-up as an interrupt, but its not usable from linux.

A.8.2 "Asychronous error signals" of the A57 TRM [0] has:
| nINTERRIRQ output Error indicator for an L2 RAM double-bit ECC error.
("7.6 Asynchronous errors" has more on this).

Errors cause L2ECTLR[30] to get set, and this value output as a signal, you may have wired
it up as an interrupt.

If you did, beware its level sensitive, and can only be cleared by writing to L2ECTLR_EL1.
You shouldn't allow linux to access this register as it could mess with the L2
configuration, which could also affect your EL3 and any secure-world software.

The arrival of this interrupt doesn't tell you which L2 tripped the error, and you can
only clear it if you write to L2ECTLR_EL1 on a CPU attached to the right L2. So this isn't
actually a shared (peripheral) interrupt.

This stuff is expected to be used by firmware, which can know the affinity constraints of
signals coming in as interrupts.


> Does 'asynchronous errors' in L2 used to report UE?

>From "7.2.4 Error correction code" single-bit errors are always corrected.
A.8.2 quoted above gives the behaviour for double-bit errors.


> In case no interrupt, can we use die-notifier subsystem to check if any error had occur
> while system shutdown?

notify_die() would imply a synchronous exception that killed a thread. SError are a whole
lot worse. Before v8.2 these are all treated as 'uncontained': unknown memory corruption.
Which in your L2 case is exactly what happened. The arch code will panic().

If your driver can print something useful to help debug the panic(), then a panic_notifier
sounds appropriate. But you can't rely on these notifiers being called, as kdump has some
hooks that affect if/when they run.

(KVM will 'contain' SError that come from a guest to the guest, as we know a distinct set
of memory was in use. You may see fatal error counters increasing without the system
panic()ing)

contained/uncontained is part of the terminology from the v8.2 RAS spec [1].


Thanks,

James


[0]
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0488c/DDI0488C_cortex_a57_mpcore_r1p0_trm.pdf
[1]
https://static.docs.arm.com/ddi0587/ca/ARM_DDI_0587C_a_RAS.pdf?_ga=2.148234679.1686960568.1560964184-897392434.1556719556

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ