[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZpaJaM1G721FdLFn@hovoldconsulting.com>
Date: Tue, 16 Jul 2024 16:53:28 +0200
From: Johan Hovold <johan@...nel.org>
To: Marc Zyngier <maz@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
LKML <linux-kernel@...r.kernel.org>,
linux-arm-kernel@...ts.infradead.org, linux-pci@...r.kernel.org,
anna-maria@...utronix.de, shawnguo@...nel.org,
s.hauer@...gutronix.de, festevam@...il.com, bhelgaas@...gle.com,
rdunlap@...radead.org, vidyas@...dia.com,
ilpo.jarvinen@...ux.intel.com, apatel@...tanamicro.com,
kevin.tian@...el.com, nipun.gupta@....com, den@...inux.co.jp,
andrew@...n.ch, gregory.clement@...tlin.com,
sebastian.hesselbarth@...il.com, gregkh@...uxfoundation.org,
rafael@...nel.org, alex.williamson@...hat.com, will@...nel.org,
lorenzo.pieralisi@....com, jgg@...lanox.com,
ammarfaizi2@...weeb.org, robin.murphy@....com,
lpieralisi@...nel.org, nm@...com, kristo@...nel.org,
vkoul@...nel.org, okaya@...nel.org, agross@...nel.org,
andersson@...nel.org, mark.rutland@....com,
shameerali.kolothum.thodi@...wei.com, yuzenghui@...wei.com,
shivamurthy.shastri@...utronix.de
Subject: Re: [patch V4 00/21] genirq, irqchip: Convert ARM MSI handling to
per device MSI domains
On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> On Mon, 15 Jul 2024 15:10:01 +0100,
> Johan Hovold <johan@...nel.org> wrote:
> > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > Johan Hovold <johan@...nel.org> wrote:
> > > > On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > > > > This is version 4 of the series to convert ARM MSI handling over to
> > > > > per device MSI domains.
> >
> > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > >
> > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > can confirm that the breakage is caused by commits:
> > > >
> > > > 3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > 233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > >
> > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > wifi on one machine:
> > > >
> > > > ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22
Correction, this doesn't fix the wifi, but I'm not seeing these errors
with the commit before cc23d1dfc959 as the ath11k driver doesn't get
this far (or doesn't probe at all).
> > > > and backing up until the commit before 233db05bc37f makes the NVMe come
> > > > up again during boot on another.
> > > >
> > > > I have not tried to debug this further.
> > >
> > > I need a few things from you though, because you're not giving much to
> > > help you (and I'm travelling, which doesn't help).
> >
> > Yeah, this was just an early heads up.
> >
> > > Can you at least investigate what in ath11k_pci_alloc_msi() causes the
> > > wifi driver to be upset? Does it normally use a single MSI vector or
> > > MSI-X? How about your nVME device?
> >
> > It uses multiple vectors, but now it falls back to trying to allocate a
> > single one and even that fails with -ENOSPC:
> >
> > ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28
> >
> > Similar for the NVMe, it uses multiple vectors normally, but now only
> > the AER interrupts appears to be allocated for each controller and there
> > is a GICv3 interrupt for the NVMe:
> >
> > 208: 0 0 0 0 0 0 0 0 ITS-PCI-MSI-0006:00:00.0 0 Edge PCIe PME, aerdrv
> > 212: 0 0 0 0 0 0 0 0 ITS-PCI-MSI-0004:00:00.0 0 Edge PCIe PME, aerdrv
> > 214: 161 0 0 0 0 0 0 0 GICv3 562 Level nvme0q0, nvme0q1
> > 215: 0 0 0 0 0 0 0 0 ITS-PCI-MSI-0002:00:00.0 0 Edge PCIe PME, aerdrv
> >
>
> That's an indication of the driver having failed its MSI allocation
> and gone back to INTx signalling.
>
> > Next boot, after disabling PCIe controller async probing, it's an MSI-X?!:
> >
> > 201: 0 0 0 0 0 0 0 0 ITS-PCI-MSI-0006:00:00.0 0 Edge PCIe PME, aerdrv
> > 203: 0 0 0 0 0 0 0 0 ITS-PCI-MSI-0004:00:00.0 0 Edge PCIe PME, aerdrv
> > 205: 0 0 0 0 0 0 0 0 ITS-PCI-MSI-0002:00:00.0 0 Edge PCIe PME, aerdrv
> > 206: 0 0 0 0 0 0 0 0 ITS-PCI-MSIX-0002:01:00.0 0 Edge nvme0q0
> >
>
> So is this issue actually tied to the async probing? Does it always
> work if you disable it?
There seem to multiple issues here.
With the full series applied and normal async (i.e. parallel) probing of
the PCIe controllers I sometimes see allocation failing with -ENOSPC
(e.g. the above ath11k errors). This seems to indicate broken locking
somewhere.
With synchronous probing, allocation always seems to succeed but the
ath11k (and modem) drivers time out as no interrupts are received.
The NVMe driver sometimes falls back to INTx signalling and can access
the drive, but often end up with an MSIX (?!) allocation and then fails
to probe:
[ 132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled
Johan
Powered by blists - more mailing lists