[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190122152339.GD52887@lakrids.cambridge.arm.com>
Date: Tue, 22 Jan 2019 15:23:39 +0000
From: Mark Rutland <mark.rutland@....com>
To: "Zhang, Lei" <zhang.lei@...fujitsu.com>
Cc: "'catalin.marinas@....com'" <catalin.marinas@....com>,
"'will.deacon@....com'" <will.deacon@....com>,
"'linux-arm-kernel@...ts.infradead.org'"
<linux-arm-kernel@...ts.infradead.org>,
"'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH] arm64 memory accesses may cause undefined fault on
Fujitsu-A64FX
On Tue, Jan 22, 2019 at 02:05:26AM +0000, Zhang, Lei wrote:
> Hi, Mark
>
> Thanks for your comments, and sorry for late.
>
> > -----Original Message-----
> > * Under what conditions can the fault occur? e.g. is this in place of
> > some other fault, or completely spurious?
> This fault can occur completely spurious under a specific hardware
> condition and instructions order.
Ok.
Can you be more specific regarding the conditions under which this
occurs? e.g. can this only occur with certain instruction sequences?
> > * Does this only occur for data abort? i.e. not instruction aborts?
> Yes. This fault only occurs for data abort.
>
> > * How often does this fault occur?
> In my test, this fault occurs once every several times in the OS boot
> sequence, and after the completion of OS boot, this fault have never
> occurred.
> In my opinion, this fault rarely occurs after the completion of OS
> boot.
I'm very concerned that this could occur during boot (even if rarely),
as that implies this is being taken EL1->EL1 or EL2->EL2.
Which exception levels can the fault be taken from?
e.g. is it possible for this fault to be taken from EL2 to EL2, or from
EL3 to EL3?
> > * Does this only apply to Stage-1, or can the same faults be taken at
> > Stage-2?
> This fault can be taken only at Stage-1.
>
> > I'm a bit surprised by the single retry. Is there any guarantee that a
> > thread will eventually stop delivering this fault code?
> I guarantee that a thread will stop delivering this fault code by the
> this patch.
> The hardware condition which cause this fault is reset at exception
> entry, therefore execution of at least one instruction is guaranteed
> by this single retry.
Ok, so we can guarantee forward progress, but in the worst case that's
down to single-step performance levels.
> > Note that all CPUs and threads share the do_bad_ignore_first variable,
> > so this is going to behave non-deterministically and kill threads in
> > some cases.
I see now that I'd misread the code, and we'll always retry the fault
(on A64FX), so this is not true.
> > This code is also preemptible, so checking the MIDR here doesn't make
> > much sense. Either this is always uniform (and we can check once in the
> > errata framework), or it's variable (e.g. on a big.LITTLE system)
> > and we need to avoid preemption up until this point.
... though this may be a problem if A64FX is integrated into a
non-uniform system (and we could unwittingly kill threads).
> > Rather than dynamically checking the MIDR, this should use the errata
> > framework, and if any A64FX CPU is discovered, set an erratum cap like
> > ARM64_WORKAROUND_CONFIG_FUJITSU_ERRATUM_010001, so we can do something
> > like:
> I try to provide a new patch to reflect your comments in today.
> Unfortunately this bug may occurs before init_cpu_hwcaps_indirect_list
> called.
As above, I'm very concerned that this could be taken from kernel
context. There are a number of cases where we cannot handle such faults:
* During boot, when we hand-over between agents (e.g. UEFI->kernel).
* Before VBAR_EL1 is initialized.
* During exception entry/return sequences (including when the KPTI
trampoline vectors are installed).
* While the KVM vectors are installed (for VHE).
Are there any constraints on when the fault can be raised? Under which
conditions does this happen?
Thanks,
Mark.
Powered by blists - more mailing lists