[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <39e1536d956bfe061a4da7446c41a1b21eac0b37.camel@linux.ibm.com>
Date: Thu, 18 Dec 2025 13:02:09 +0100
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Huacai Chen <chenhuacai@...nel.org>
Cc: Tianrui Zhao <zhaotianrui@...ngson.cn>, Bibo Mao <maobibo@...ngson.cn>,
Bjorn Helgaas <bhelgaas@...gle.com>,
Jan Kiszka <jan.kiszka@...mens.com>,
linux-s390 <linux-s390@...r.kernel.org>, loongarch@...ts.linux.dev,
Farhan
Ali <alifm@...ux.ibm.com>,
Matthew Rosato <mjrosato@...ux.ibm.com>,
Gerald
Schaefer <gerald.schaefer@...ux.ibm.com>,
Heiko Carstens
<hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
Alexander Gordeev
<agordeev@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>,
Christian
Borntraeger <borntraeger@...ux.ibm.com>,
Gerd Bayer
<gbayer@...ux.ibm.com>, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org
Subject: Re: [PATCH v5 1/2] PCI: Fix isolated PCI function probing with ARI
and SR-IOV
On Wed, 2025-12-17 at 14:55 +0800, Huacai Chen wrote:
> On Thu, Dec 4, 2025 at 5:45 AM Niklas Schnelle <schnelle@...ux.ibm.com> wrote:
> >
> > On Mon, 2025-12-01 at 22:45 +0800, Huacai Chen wrote:
> > >
> > --- snip ---
> > > You said that "it feels like this is just a hack to probe an odd
> > > topology". Yes, to some extent you are right.
> > >
> > > 1, One of our SoC (LS2K3000) has a special device which has func1 but
> > > without func0. To let the PCI core scan func1 we can only make
> > > hypervisor_isolated_pci_functions() return true.
> > > 2, In the above case, PCI_SCAN_ALL_PCIE_DEVS has no help.
> > > 3, Though we change hypervisor_isolated_pci_functions() to resolve the
> > > above problem, it also lets us pass isolated PCI functions to a guest
> > > OS instance.
> > >
> > > As a summary, for real machines commit a02fd05661d73a850 is a hack to
> > > probe an odd device, for virtual machines it allows passing isolated
> > > PCI functions.
> >
> > Ok, thanks for the answer. So let's see how we can debug this and get
> > to a solution that works for both of us. Looking around a bit I see
> > that your pci_loongson_map_bus() has some special handling for trying
> > not to access non-existent devices added by your commit 2410e3301fcc
> > ("PCI: loongson: Don't access non-existent devices"). I wonder if with
> > this patch applied we're running into this same issue but with a devfn
> > that was previously not tried and is not covered by your checks? And
> > maybe since your root complex doesn't return 0xff for these non-
> > existent devices we could end up trying to probe AHCI on such an empty
> > slot misinterpreting whatever it returns as matching device/vendor?
> Commit 2410e3301fcc seems to have no relationship with current problems.
I'm not so sure. The only thing this patch is potentially supposed to
change is which devfns get enumerated and thus config space accessed
looking for a device. And that commit talks about accessing non
existent devices causing a system hang so that does seem fitting in
principle.
> >
--- snip ---
> > Could you try redoing the test with the AHCI hang but add a print of
> > the affected bus/device/function that AHCI thinks it is probing? Then
> > if the above theory applies we should see it trying to probe on a
> > device that is missing in the correctly booted case and got past your
> > existing checks.
> By redoing this test we found there is only one AHCI detected, and the
> BDF is: bus=0, device=8, fun=0.
>
> With or without this patch, only one AHCI. But without this patch, the
> AHCI initialization doesn't hang.
>
This is all very odd. Just so there is no chance of misunderstanding.
You did check the BDF that the ahci driver is trying to probe directly?
I.e. something like what I added as the top commit here:
https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/log/?h=loongarch_debug
This is because with the AHCI controller having a devfn 08.0 devfn and
likely dev->multifunction not set this patch would make a difference in
that it would try to enumerate 08.1 and so on while without this patch
these would be skipped because of the dev && !dev->multifunction
condition even though isolated function probing should look at all
functions. And I was thinking maybe this causes us to end up trying to
probe an AHCI controller where there is none.
Another thing I could imagine, especially with commit 2410e3301fcc
("PCI: loongson: Don't access non-existent devices") in mind is that
accessing the device/vendor config space for some non existent devices
leaves your PCIe controller in some bad state and then the MMIOs for
the AHCI enable go lost or something. Maybe you could add debug code in
the relevant parts of drivers/pci/controller/pci-loongson.c to check
which devices get accessed with this patch vs without it? Would it help
if I provided a debug patch for that? Though I really don't know what
part is relevant for the system you're seeing the problem with.
Thanks,
Niklas
Powered by blists - more mailing lists