[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20191211110049.54a2d6f3@donnerap.cambridge.arm.com>
Date: Wed, 11 Dec 2019 11:00:49 +0000
From: Andre Przywara <andre.przywara@....com>
To: Bjorn Helgaas <helgaas@...nel.org>
Cc: Lorenzo Pieralisi <lorenzo.pieralisi@....com>,
Andrew Murray <andrew.murray@....com>,
"Rafael J . Wysocki" <rjw@...ysocki.net>,
Len Brown <lenb@...nel.org>, Will Deacon <will@...nel.org>,
Catalin Marinas <catalin.marinas@....com>,
linux-arm-kernel@...ts.infradead.org, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org, linux-acpi@...r.kernel.org
Subject: Re: [PATCH] pcie: Add quirk for the Arm Neoverse N1SDP platform
On Tue, 10 Dec 2019 08:41:15 -0600
Bjorn Helgaas <helgaas@...nel.org> wrote:
Hi Bjorn,
> On Mon, Dec 09, 2019 at 04:06:38PM +0000, Andre Przywara wrote:
> > From: Deepak Pandey <Deepak.Pandey@....com>
> >
> > The Arm N1SDP SoC suffers from some PCIe integration issues, most
> > prominently config space accesses to not existing BDFs being answered
> > with a bus abort, resulting in an SError.
>
> Can we tease this apart a little more? Linux doesn't program all the
> bits that control error signaling, so even on hardware that works
> perfectly, much of this behavior is determined by what firmware did.
> I wonder if Linux could be more careful about this.
>
> "Bus abort" is not a term used in PCIe.
Yes, sorry, that was my sloppy term, also aiming more at the CPU side of the bus, between the cores and the RC.
> IIUC, a config read to a
> device that doesn't exist should terminate with an Unsupported Request
> completion, e.g., see the implementation note in PCIe r5.0 sec 2.3.1.
Yes, that's what Lorenzo mentioned as well.
> The UR should be an uncorrectable non-fatal error (Table 6-5), and
> Figures 6-2 and 6-3 show how it should be handled and when it should
> be signaled as a system error. In case you don't have a copy of the
> spec, I extracted those two figures and put them at [1].
Thanks for that.
So in the last few months we tossed several ideas around how to work-around this without kernel intervention, all of them turned out to be not working. There are indeed registers in the RC that influence error reporting to the CPU side, but even if we could suppress (or catch) the SError, we can't recover and fixup the read transaction to the CPU. Even Lorenzo gave up on this ;-) As far as I understood this, there are gates missing which are supposed to translate this specific UR into a valid "all-1s" response.
> Can you collect "lspci -vvxxx" output to see if we can correlate it
> with those figures and the behavior you see?
Attached.
>
> [1] https://drive.google.com/file/d/1ihhdQvr0a7ZEJG-3gPddw1Tq7cTFAsah/view?usp=sharing
>
> > To mitigate this, the firmware scans the bus before boot (catching the
> > SErrors) and creates a table with valid BDFs, which acts as a filter for
> > Linux' config space accesses.
> >
> > Add code consulting the table as an ACPI PCIe quirk, also register the
> > corresponding device tree based description of the host controller.
> > Also fix the other two minor issues on the way, namely not being fully
> > ECAM compliant and config space accesses being restricted to 32-bit
> > accesses only.
>
> As I'm sure you've noticed, controllers that support only 32-bit
> config writes are not spec compliant and devices may not work
> correctly. The comment in pci_generic_config_write32() explains why.
Yes, I understand. Actually it seems to work fine in my experiments without that part of the patch, but the hardware guys insisted that it's needed. I will double check.
> You may not trip over this problem frequently, but I wouldn't call it
> a "minor" issue because when you *do* trip over it, you have no
> indication that a register was corrupted.
>
> Even ECAM compliance is not really minor -- if this controller were
> fully compliant with the spec, you would need ZERO Linux changes to
> support it. Every quirk like this means additional maintenance
> burden, and it's not just a one-time thing. It means old kernels that
> *should* "just work" on your system will not work unless somebody
> backports the quirk.
I am well aware of that, and we had quite some discussions internally, with quite some opposition.
The point is that this board has virtually everything behind PCI (which is good!), so not having working PCI renders this virtually useless. And installing Linux brings back warm and fuzzy memories of 1990's boot floppies - just without the actual disks ;-)
So anything that improves this situation in any way is helpful.
On the technical side this is a quirk, in the ACPI quirk framework(!), so it shouldn't affect anyone without the magic ACPI ID strings. People could even compile it out if needed. Plus the actual code is contained in a single, new file.
So, yes, I see that it's not pretty and we should not have broken hardware in the first place, but this is probably as good as this will get.
Cheers,
Andre
> > This allows the Arm Neoverse N1SDP board to boot Linux without crashing
> > and to access *any* devices (there are no platform devices except UART).
View attachment "lspci_n1sdp.txt" of type "text/plain" (69934 bytes)
Powered by blists - more mailing lists