lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZpaJaM1G721FdLFn@hovoldconsulting.com>
Date: Tue, 16 Jul 2024 16:53:28 +0200
From: Johan Hovold <johan@...nel.org>
To: Marc Zyngier <maz@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-arm-kernel@...ts.infradead.org, linux-pci@...r.kernel.org,
	anna-maria@...utronix.de, shawnguo@...nel.org,
	s.hauer@...gutronix.de, festevam@...il.com, bhelgaas@...gle.com,
	rdunlap@...radead.org, vidyas@...dia.com,
	ilpo.jarvinen@...ux.intel.com, apatel@...tanamicro.com,
	kevin.tian@...el.com, nipun.gupta@....com, den@...inux.co.jp,
	andrew@...n.ch, gregory.clement@...tlin.com,
	sebastian.hesselbarth@...il.com, gregkh@...uxfoundation.org,
	rafael@...nel.org, alex.williamson@...hat.com, will@...nel.org,
	lorenzo.pieralisi@....com, jgg@...lanox.com,
	ammarfaizi2@...weeb.org, robin.murphy@....com,
	lpieralisi@...nel.org, nm@...com, kristo@...nel.org,
	vkoul@...nel.org, okaya@...nel.org, agross@...nel.org,
	andersson@...nel.org, mark.rutland@....com,
	shameerali.kolothum.thodi@...wei.com, yuzenghui@...wei.com,
	shivamurthy.shastri@...utronix.de
Subject: Re: [patch V4 00/21] genirq, irqchip: Convert ARM MSI handling to
 per device MSI domains

On Tue, Jul 16, 2024 at 11:30:05AM +0100, Marc Zyngier wrote:
> On Mon, 15 Jul 2024 15:10:01 +0100,
> Johan Hovold <johan@...nel.org> wrote:
> > On Mon, Jul 15, 2024 at 01:58:13PM +0100, Marc Zyngier wrote:
> > > On Mon, 15 Jul 2024 12:18:47 +0100,
> > > Johan Hovold <johan@...nel.org> wrote:
> > > > On Sun, Jun 23, 2024 at 05:18:31PM +0200, Thomas Gleixner wrote:
> > > > > This is version 4 of the series to convert ARM MSI handling over to
> > > > > per device MSI domains.
> > 
> > > > This series only showed up in linux-next last Friday and broke interrupt
> > > > handling on Qualcomm platforms like sc8280xp (e.g. Lenovo ThinkPad X13s)
> > > > and x1e80100 that use the GIC ITS for PCIe MSIs.
> > > > 
> > > > I've applied the series (21 commits from linux-next) on top of 6.10 and
> > > > can confirm that the breakage is caused by commits:
> > > > 
> > > > 	3d1c927c08fc ("irqchip/gic-v3-its: Switch platform MSI to MSI parent")
> > > > 	233db05bc37f ("irqchip/gic-v3-its: Provide MSI parent for PCI/MSI[-X]")
> > > > 
> > > > Applying the series up until the change before 3d1c927c08fc unbreaks the
> > > > wifi on one machine:
> > > > 
> > > > 	ath11k_pci 0006:01:00.0: failed to enable msi: -22
> > > > 	ath11k_pci 0006:01:00.0: probe with driver ath11k_pci failed with error -22

Correction, this doesn't fix the wifi, but I'm not seeing these errors
with the commit before cc23d1dfc959 as the ath11k driver doesn't get
this far (or doesn't probe at all).

> > > > and backing up until the commit before 233db05bc37f makes the NVMe come
> > > > up again during boot on another.
> > > > 
> > > > I have not tried to debug this further.
> > > 
> > > I need a few things from you though, because you're not giving much to
> > > help you (and I'm travelling, which doesn't help).
> > 
> > Yeah, this was just an early heads up.
> > 
> > > Can you at least investigate what in ath11k_pci_alloc_msi() causes the
> > > wifi driver to be upset? Does it normally use a single MSI vector or
> > > MSI-X? How about your nVME device?
> > 
> > It uses multiple vectors, but now it falls back to trying to allocate a
> > single one and even that fails with -ENOSPC:
> > 
> > 	ath11k_pci 0006:01:00.0: ath11k_pci_alloc_msi - requesting one vector failed: -28
> > 
> > Similar for the NVMe, it uses multiple vectors normally, but now only
> > the AER interrupts appears to be allocated for each controller and there
> > is a GICv3 interrupt for the NVMe:
> > 
> > 208:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
> > 212:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
> > 214:        161          0          0          0          0          0          0          0     GICv3 562 Level     nvme0q0, nvme0q1
> > 215:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
> >
> 
> That's an indication of the driver having failed its MSI allocation
> and gone back to INTx signalling.
> 
> > Next boot, after disabling PCIe controller async probing, it's an MSI-X?!:
> > 
> > 201:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0006:00:00.0   0 Edge      PCIe PME, aerdrv
> > 203:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0004:00:00.0   0 Edge      PCIe PME, aerdrv
> > 205:          0          0          0          0          0          0          0          0  ITS-PCI-MSI-0002:00:00.0   0 Edge      PCIe PME, aerdrv
> > 206:          0          0          0          0          0          0          0          0  ITS-PCI-MSIX-0002:01:00.0   0 Edge      nvme0q0
> >
> 
> So is this issue actually tied to the async probing? Does it always
> work if you disable it?

There seem to multiple issues here.

With the full series applied and normal async (i.e. parallel) probing of
the PCIe controllers I sometimes see allocation failing with -ENOSPC
(e.g. the above ath11k errors). This seems to indicate broken locking
somewhere.

With synchronous probing, allocation always seems to succeed but the
ath11k (and modem) drivers time out as no interrupts are received.

The NVMe driver sometimes falls back to INTx signalling and can access
the drive, but often end up with an MSIX (?!) allocation and then fails
to probe:

	[  132.084740] nvme nvme0: I/O tag 17 (1011) QID 0 timeout, completion polled

Johan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ