[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <86le1z3nak.wl-maz@kernel.org>
Date: Wed, 17 Jul 2024 19:07:15 +0100
From: Marc Zyngier <maz@...nel.org>
To: Johan Hovold <johan@...nel.org>
Cc: Thomas Gleixner <tglx@...utronix.de>,
LKML <linux-kernel@...r.kernel.org>,
linux-arm-kernel@...ts.infradead.org,
linux-pci@...r.kernel.org,
anna-maria@...utronix.de,
shawnguo@...nel.org,
s.hauer@...gutronix.de,
festevam@...il.com,
bhelgaas@...gle.com,
rdunlap@...radead.org,
vidyas@...dia.com,
ilpo.jarvinen@...ux.intel.com,
apatel@...tanamicro.com,
kevin.tian@...el.com,
nipun.gupta@....com,
den@...inux.co.jp,
andrew@...n.ch,
gregory.clement@...tlin.com,
sebastian.hesselbarth@...il.com,
gregkh@...uxfoundation.org,
rafael@...nel.org,
alex.williamson@...hat.com,
will@...nel.org,
lorenzo.pieralisi@....com,
jgg@...lanox.com,
ammarfaizi2@...weeb.org,
robin.murphy@....com,
lpieralisi@...nel.org,
nm@...com,
kristo@...nel.org,
vkoul@...nel.org,
okaya@...nel.org,
agross@...nel.org,
andersson@...nel.org,
mark.rutland@....com,
shameerali.kolothum.thodi@...wei.com,
yuzenghui@...wei.com
Subject: Re: [patch V4 00/21] genirq, irqchip: Convert ARM MSI handling to per device MSI domains
On Wed, 17 Jul 2024 14:38:59 +0100,
Johan Hovold <johan@...nel.org> wrote:
>
> On Wed, Jul 17, 2024 at 01:54:40PM +0100, Marc Zyngier wrote:
> > On Wed, 17 Jul 2024 08:23:39 +0100,
> > Johan Hovold <johan@...nel.org> wrote:
>
> > > I believe there is a kernel parameter for this (e.g.
> > > module.async_probe), but I just disable async probing for the Qualcomm
> > > PCIe driver I'm using:
> >
> > I had tried this module parameter, but it didn't change anything on my
> > end.
>
> > I'll have a look whether the TX1 PCIe driver uses this. It's
> > positively ancient, so I wouldn't bet that it has been touched
> > significantly in the past 5 years.
>
> Perhaps async probing just changes the symptoms, the NVMe and wifi
> doesn't work in either case.
Yeah, my impression is that this changes the order in which LPIs get
allocated, but the core symptom is the same.
>
> > > [ 8.692011] Reusing ITT for devID 0
> > > [ 8.693668] Reusing ITT for devID 0
> >
> > This is really odd. It indicates that you have several devices sharing
> > the same DeviceID, which I seriously doubt it is the case in a
> > laptop. Do you have any non-transparent bridge here? lspci would help.
>
> Yeah, and these messages do not show up without the series (see log
> below). They are there in the previous synchronous log however.
>
> 0002:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
> 0002:01:00.0 Non-Volatile memory controller: KIOXIA Corporation NVMe SSD Controller BG4 (DRAM-less)
> 0004:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
> 0004:01:00.0 Unassigned class [ff00]: Qualcomm Technologies, Inc SDX55 [Snapdragon X55 5G]
> 0006:00:00.0 PCI bridge: Qualcomm Technologies, Inc SC8280XP PCI Express Root Port
> 0006:01:00.0 Network controller: Qualcomm Technologies, Inc QCNFA765 Wireless Network Adapter (rev 01)
Right, this is a very straightforward setup, Design-crap-ware-style.
Nothing that would alias any device.
>
> > I'm starting to suspect that the new code doesn't carry all the
> > required bits for the DevID, and that we end-up trying to allocated
> > interrupts from the pool allocated to another device, which can never
> > be a good thing, and would explain why everything dies a painful
> > death.
> >
> > Can you run the same trace with the whole thing reverted? I think
> > we're on something here.
>
> See below, using normal asynchronous probing like the previous log.
And as expected, no aliasing showing up in this log. Somehow, we're
not able to distinguish between the different PCI domains anymore,
leading to all sorts of funnies.
For the record, I've added some extra debug in the its driver and ran
the result on TX1, old and new kernels.
Before this series:
[ 10.139806] nvme nvme0: pci function 0006:58:00.0
[ 10.158599] nvme 0006:58:00.0: devid = 35800
With this series:
[ 10.143729] nvme nvme0: pci function 0006:58:00.0
[ 10.181775] nvme 0006:58:00.0: devid = 5800
Clearly, we've lost something in the battle. I'll keep digging.
M.
--
Without deviation from the norm, progress is not possible.
Powered by blists - more mailing lists