linux-kernel - Re: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <d8a8107c-cd31-2517-79df-2ebfbbedb7da@arm.com>
Date:   Tue, 22 Jan 2019 14:42:18 +0000
From:   James Morse <james.morse@....com>
To:     "Zhang, Lei" <zhang.lei@...fujitsu.com>
Cc:     'Mark Rutland' <mark.rutland@....com>,
        "'catalin.marinas@....com'" <catalin.marinas@....com>,
        "'will.deacon@....com'" <will.deacon@....com>,
        "'linux-arm-kernel@...ts.infradead.org'" 
        <linux-arm-kernel@...ts.infradead.org>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] arm64 memory accesses may cause undefined fault on
 Fujitsu-A64FX

Hello,

On 22/01/2019 02:05, Zhang, Lei wrote:
> Mark Rutland wrote:
>> * How often does this fault occur?
> In my test, this fault occurs once every several times
> in the OS boot sequence, and after the completion of OS boot,
> this fault have never occurred.
> In my opinion, this fault rarely occurs
> after the completion of OS boot.

Can you share anything about why this is? You mention a hardware-condition
that is reset at exception entry....

>> I'm a bit surprised by the single retry. Is there any guarantee that a
>> thread will eventually stop delivering this fault code?

> I guarantee that a thread will stop delivering this 
> fault code by the this patch.
> The hardware condition which cause this fault is 
> reset at exception entry, therefore execution of at 
> least one instruction is guaranteed by this single retry.

... so its possible to take this fault during kernel_entry when we've taken an irq?

This will overwrite the ELR and SPSR, (and possibly the FAR and ESR), meaning we've
 lost that information and can't return to the point in the kernel that took the
irq.

If we try, we might end up spinning through the irq handler, as the ELR might
now point
to el1_irq's kernel_entry.

We can spot we took an exception from the entry text ... but all we can do then
is panic().
I'm not sure its worth working around this if its just a matter of time before this
happens. (you mention its less likely after boot, it would be good to know why...)

Thanks,

James