linux-kernel - Re: framebuffer corruption due to overlapping stp instructions on arm64

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <99fff4fe-afa9-f12f-a518-472a9dd1c530@arm.com>
Date:   Mon, 6 Aug 2018 13:42:20 +0100
From:   Robin Murphy <robin.murphy@....com>
To:     Mikulas Patocka <mpatocka@...hat.com>,
        Ard Biesheuvel <ard.biesheuvel@...aro.org>
Cc:     Thomas Petazzoni <thomas.petazzoni@...e-electrons.com>,
        Joao Pinto <Joao.Pinto@...opsys.com>,
        linux-pci <linux-pci@...r.kernel.org>,
        Jingoo Han <jingoohan1@...il.com>,
        Will Deacon <will.deacon@....com>,
        Russell King <linux@...linux.org.uk>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Matt Sealey <neko@...uhatsu.net>,
        Catalin Marinas <catalin.marinas@....com>,
        linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>
Subject: Re: framebuffer corruption due to overlapping stp instructions on
 arm64

On 06/08/18 11:25, Mikulas Patocka wrote:
[...]
>> None of this explains why some transactions fail to make it across
>> entirely. The overlapping writes in question write the same data to
>> the memory locations that are covered by both, and so the ordering in
>> which the transactions are received should not affect the outcome.
> 
> You're right that the corruption couldn't be explained just by reordering
> writes. My hypothesis is that the PCIe controller tries to disambiguate
> the overlapping writes, but the disambiguation logic was not tested and it
> is buggy. If there's a barrier between the overlapping writes, the PCIe
> controller won't see any overlapping writes, so it won't trigger the
> faulty disambiguation logic and it works.
> 
> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> that could insert barriers between non-cached writes automatically?

I don't think there is, and even if there was I imagine it would have a 
pretty hideous effect on non-coherent DMA buffers and the various other 
places in which we have Normal-NC mappings of actual system RAM.

> I observe these kinds of corruptions:
> - failing to write a few bytes

That could potentially be explained by the reordering/atomicity issues 
Matt mentioned, i.e. the load is observing part of the store, before the 
store has fully completed.

> - writing a few bytes that were written 16 bytes before
> - writing a few bytes that were written 16 bytes after

Those sound more like the interconnect or root complex ignoring the byte 
strobes on an unaligned burst, of which I think the simplistic view 
would be "it's broken".

FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x 
Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and 
it's still happily flickering pixels in the corner of the console after 
nearly an hour (in parallel with some iperf3 just to ensure plenty of 
PCIe traffic). I would strongly suspect this issue is particular to 
Armada 8k, so its' probably one for the Marvell folks to take a closer 
look at - I believe some previous interconnect issues on those SoCs were 
actually fixable in firmware.

Robin.