linux-kernel - Re: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190122152339.GD52887@lakrids.cambridge.arm.com>
Date:   Tue, 22 Jan 2019 15:23:39 +0000
From:   Mark Rutland <mark.rutland@....com>
To:     "Zhang, Lei" <zhang.lei@...fujitsu.com>
Cc:     "'catalin.marinas@....com'" <catalin.marinas@....com>,
        "'will.deacon@....com'" <will.deacon@....com>,
        "'linux-arm-kernel@...ts.infradead.org'" 
        <linux-arm-kernel@...ts.infradead.org>,
        "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] arm64 memory accesses may cause undefined fault on
 Fujitsu-A64FX

On Tue, Jan 22, 2019 at 02:05:26AM +0000, Zhang, Lei wrote:
> Hi, Mark
> 
> Thanks for your comments, and sorry for late.
> 
> > -----Original Message-----
> > * Under what conditions can the fault occur? e.g. is this in place of
> >   some other fault, or completely spurious?

> This fault can occur completely spurious under a specific hardware
> condition and instructions order.

Ok.

Can you be more specific regarding the conditions under which this
occurs? e.g. can this only occur with certain instruction sequences?

> > * Does this only occur for data abort? i.e. not instruction aborts?

> Yes. This fault only occurs for data abort.
> 
> > * How often does this fault occur?

> In my test, this fault occurs once every several times in the OS boot
> sequence, and after the completion of OS boot, this fault have never
> occurred.
> In my opinion, this fault rarely occurs after the completion of OS
> boot.

I'm very concerned that this could occur during boot (even if rarely),
as that implies this is being taken EL1->EL1 or EL2->EL2.

Which exception levels can the fault be taken from?

e.g. is it possible for this fault to be taken from EL2 to EL2, or from
EL3 to EL3?

> > * Does this only apply to Stage-1, or can the same faults be taken at
> >   Stage-2?
> This fault can be taken only at Stage-1.
> 
> > I'm a bit surprised by the single retry. Is there any guarantee that a
> > thread will eventually stop delivering this fault code?

> I guarantee that a thread will stop delivering this fault code by the
> this patch.
> The hardware condition which cause this fault is reset at exception
> entry, therefore execution of at least one instruction is guaranteed
> by this single retry.

Ok, so we can guarantee forward progress, but in the worst case that's
down to single-step performance levels.

> > Note that all CPUs and threads share the do_bad_ignore_first variable,
> > so this is going to behave non-deterministically and kill threads in
> > some cases.

I see now that I'd misread the code, and we'll always retry the fault
(on A64FX), so this is not true.

> > This code is also preemptible, so checking the MIDR here doesn't make
> > much sense. Either this is always uniform (and we can check once in the
> > errata framework), or it's variable (e.g. on a big.LITTLE system)
> > and we need to avoid preemption up until this point.

... though this may be a problem if A64FX is integrated into a
non-uniform system (and we could unwittingly kill threads).

> > Rather than dynamically checking the MIDR, this should use the errata
> > framework, and if any A64FX CPU is discovered, set an erratum cap like
> > ARM64_WORKAROUND_CONFIG_FUJITSU_ERRATUM_010001, so we can do something
> > like:

> I try to provide a new patch to reflect your comments in today.
> Unfortunately this bug may occurs before init_cpu_hwcaps_indirect_list
> called.

As above, I'm very concerned that this could be taken from kernel
context. There are a number of cases where we cannot handle such faults:

* During boot, when we hand-over between agents (e.g. UEFI->kernel).

* Before VBAR_EL1 is initialized.

* During exception entry/return sequences (including when the KPTI
  trampoline vectors are installed).

* While the KVM vectors are installed (for VHE).

Are there any constraints on when the fault can be raised? Under which
conditions does this happen?

Thanks,
Mark.