linux-kernel - RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum 010001

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <8898674D84E3B24BA3A2D289B872026A6A30C360@G01JPEXMBKW03>
Date:   Tue, 5 Feb 2019 12:49:28 +0000
From:   "Zhang, Lei" <zhang.lei@...fujitsu.com>
To:     'Catalin Marinas' <catalin.marinas@....com>
CC:     "'linux-kernel@...r.kernel.org'" <linux-kernel@...r.kernel.org>,
        "'Mark Rutland'" <mark.rutland@....com>,
        "'linux-arm-kernel@...ts.infradead.org'" 
        <linux-arm-kernel@...ts.infradead.org>,
        "'will.deacon@....com'" <will.deacon@....com>,
        "'james.morse@....com'" <james.morse@....com>
Subject: RE: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX erratum
 010001

Hi Catalin,

> -----Original Message-----
> From: Catalin Marinas [mailto:catalin.marinas@....com]
> Sent: Wednesday, January 30, 2019 3:11 AM
> To: Zhang, Lei 
> Cc: 'linux-kernel@...r.kernel.org'; 'Mark Rutland';
> 'linux-arm-kernel@...ts.infradead.org'; 'will.deacon@....com';
> 'james.morse@....com'
> Subject: Re: [PATCH v3 0/1] arm64: Add workaround for Fujitsu A64FX
> erratum 010001
> 
> Could you please copy the whole description from the cover letter to the
> actual patch and only send one email (full description as in here
> together with the patch)? If we commit this to the kernel, it would be
> useful to have the information in the log for reference later on.

Thank you for your suggestion. I will send one email with whole description.

> So this looks like new information on the hardware behaviour since the
> v2 of the patch. Can this fault occur for any type of instruction
> accessing the memory or only for SVE instructions?

This erratum is that any load/store instruction, including Armv8 and SVE, 
except non-fault access might occur a spurious fault.

> How likely is it to trigger this erratum? In other words, aren't we
> better off with a spurious fault that we ignore rather than toggling the
> TCR_ELx.NFD1 bit?

Although the erratum occurs exceptionally rare, this path is required 
to handle the issue pointed out by James and Mark in:
  https://lkml.org/lkml/2019/1/22/533,
  https://lkml.org/lkml/2019/1/22/642.

As James and Mark pointed, if the erratum occurs at EL1/EL2 before 
system registers, ELR and SPSR, are backed up, these registers will 
be overwritten and we will lose that information.

So, we set the TCR_ELx.NFD1=0 during EL1/EL2.
Please see the supplemental explanation in the end of this mail.

> The problem is that this bit may be cached in the TLB (I haven't checked
> the ARM ARM but that's usually the case with the TCR_ELx bits). If
> that's the case, you can't guarantee a change unless you also perform
> a
> TLBI VMALL. Arguably, if Fujitsu's microarchitecture doesn't cache the
> NFD bits in the TLB, we could apply the workaround but I'd rather have
> the spurious trap if it's not too often.

It is not necessary to perform a TLBI VMALL in A64FX microarchitecture 
to guarantee a change of TCR_ELx.{NFD0,NFD1}. 

> Could speculative loads also trigger this? Another option would be to
> toggle it during kernel_neon_begin/end (with the caveat of TLBI as
> mentioned above).

No, a speculative load does not trigger this erratum. 

Here are supplemental explanations:

Since this erratum occurs only when TCR_ELx.NFD1=1, 
we keep TCR_ELx.NFD1=0 during EL1/EL2.
By doing so, the erratum occurs only in EL0 and the 
spurious trap can be handled by the fault handler.

To keep TCR_ELx.NFD1=0 in EL1/EL2, there are two critical 
sections to assure the completeness of the implementation.
One is the transition from EL0 to EL1/EL2 and the other 
is from EL1/EL2 to EL0

For the former case, I set TCR_ELx.NFD1=0 at codes tramp_map_kernel. 
And there is no load/store instruction before setting 
TCR_ELx.NFD1=0 at EL1/EL2, so undefined fault will not be happened.

For the latter case, I set TCR_ELx.NFD1=1 at codes tramp_unmap_kernel. 
And there is no load/store instruction after setting 
TCR_ELx.NFD1=1 at EL1/EL2, so undefined fault will not be happened.

To handle the spurious fault in EL0,
I replace the fault handler for Data abort DFSC=0b111111 with 
a new fault handler to ignore this spurious fault caused by the erratum.

Thanks,
Zhang Lei