[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <99fff4fe-afa9-f12f-a518-472a9dd1c530@arm.com>
Date: Mon, 6 Aug 2018 13:42:20 +0100
From: Robin Murphy <robin.murphy@....com>
To: Mikulas Patocka <mpatocka@...hat.com>,
Ard Biesheuvel <ard.biesheuvel@...aro.org>
Cc: Thomas Petazzoni <thomas.petazzoni@...e-electrons.com>,
Joao Pinto <Joao.Pinto@...opsys.com>,
linux-pci <linux-pci@...r.kernel.org>,
Jingoo Han <jingoohan1@...il.com>,
Will Deacon <will.deacon@....com>,
Russell King <linux@...linux.org.uk>,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Matt Sealey <neko@...uhatsu.net>,
Catalin Marinas <catalin.marinas@....com>,
linux-arm-kernel <linux-arm-kernel@...ts.infradead.org>
Subject: Re: framebuffer corruption due to overlapping stp instructions on
arm64
On 06/08/18 11:25, Mikulas Patocka wrote:
[...]
>> None of this explains why some transactions fail to make it across
>> entirely. The overlapping writes in question write the same data to
>> the memory locations that are covered by both, and so the ordering in
>> which the transactions are received should not affect the outcome.
>
> You're right that the corruption couldn't be explained just by reordering
> writes. My hypothesis is that the PCIe controller tries to disambiguate
> the overlapping writes, but the disambiguation logic was not tested and it
> is buggy. If there's a barrier between the overlapping writes, the PCIe
> controller won't see any overlapping writes, so it won't trigger the
> faulty disambiguation logic and it works.
>
> Could the ARM engineers look if there's some chicken bit in Cortex-A72
> that could insert barriers between non-cached writes automatically?
I don't think there is, and even if there was I imagine it would have a
pretty hideous effect on non-coherent DMA buffers and the various other
places in which we have Normal-NC mappings of actual system RAM.
> I observe these kinds of corruptions:
> - failing to write a few bytes
That could potentially be explained by the reordering/atomicity issues
Matt mentioned, i.e. the load is observing part of the store, before the
store has fully completed.
> - writing a few bytes that were written 16 bytes before
> - writing a few bytes that were written 16 bytes after
Those sound more like the interconnect or root complex ignoring the byte
strobes on an unaligned burst, of which I think the simplistic view
would be "it's broken".
FWIW I stuck my old Nvidia 7600GT card in my Arm Juno r2 board (2x
Cortex-A72), built your test program natively with GCC 8.1.1 at -O2, and
it's still happily flickering pixels in the corner of the console after
nearly an hour (in parallel with some iperf3 just to ensure plenty of
PCIe traffic). I would strongly suspect this issue is particular to
Armada 8k, so its' probably one for the Marvell folks to take a closer
look at - I believe some previous interconnect issues on those SoCs were
actually fixable in firmware.
Robin.
Powered by blists - more mailing lists