lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <39e1536d956bfe061a4da7446c41a1b21eac0b37.camel@linux.ibm.com>
Date: Thu, 18 Dec 2025 13:02:09 +0100
From: Niklas Schnelle <schnelle@...ux.ibm.com>
To: Huacai Chen <chenhuacai@...nel.org>
Cc: Tianrui Zhao <zhaotianrui@...ngson.cn>, Bibo Mao <maobibo@...ngson.cn>,
        Bjorn Helgaas <bhelgaas@...gle.com>,
        Jan Kiszka <jan.kiszka@...mens.com>,
        linux-s390	 <linux-s390@...r.kernel.org>, loongarch@...ts.linux.dev,
        Farhan
 Ali	 <alifm@...ux.ibm.com>,
        Matthew Rosato <mjrosato@...ux.ibm.com>,
        Gerald
 Schaefer	 <gerald.schaefer@...ux.ibm.com>,
        Heiko Carstens
 <hca@...ux.ibm.com>, Vasily Gorbik <gor@...ux.ibm.com>,
        Alexander Gordeev
 <agordeev@...ux.ibm.com>,
        Sven Schnelle	 <svens@...ux.ibm.com>,
        Christian
 Borntraeger <borntraeger@...ux.ibm.com>,
        Gerd Bayer
 <gbayer@...ux.ibm.com>, linux-kernel@...r.kernel.org,
        linux-pci@...r.kernel.org
Subject: Re: [PATCH v5 1/2] PCI: Fix isolated PCI function probing with ARI
 and SR-IOV

On Wed, 2025-12-17 at 14:55 +0800, Huacai Chen wrote:
> On Thu, Dec 4, 2025 at 5:45 AM Niklas Schnelle <schnelle@...ux.ibm.com> wrote:
> > 
> > On Mon, 2025-12-01 at 22:45 +0800, Huacai Chen wrote:
> > > 
> > --- snip ---
> > > You said that "it feels like this is just a hack to probe an odd
> > > topology". Yes, to some extent you are right.
> > > 
> > > 1, One of our SoC (LS2K3000) has a special device which has func1 but
> > > without func0. To let the PCI core scan func1 we can only make
> > > hypervisor_isolated_pci_functions() return true.
> > > 2, In the above case, PCI_SCAN_ALL_PCIE_DEVS has no help.
> > > 3, Though we change hypervisor_isolated_pci_functions() to resolve the
> > > above problem, it also lets us pass isolated PCI functions to a guest
> > > OS instance.
> > > 
> > > As a summary, for real machines commit a02fd05661d73a850 is a hack to
> > > probe an odd device, for virtual machines it allows passing isolated
> > > PCI functions.
> > 
> > Ok, thanks for the answer. So let's see how we can debug this and get
> > to a solution that works for both of us. Looking around a bit I see
> > that your pci_loongson_map_bus() has some special handling for trying
> > not to access non-existent devices added by your commit 2410e3301fcc
> > ("PCI: loongson: Don't access non-existent devices"). I wonder if with
> > this patch applied we're running into this same issue but with a devfn
> > that was previously not tried and is not covered by your checks? And
> > maybe since your root complex doesn't return 0xff for these non-
> > existent devices we could end up trying to probe AHCI on such an empty
> > slot misinterpreting whatever it returns as matching device/vendor?
> Commit 2410e3301fcc seems to have no relationship with current problems.

I'm not so sure. The only thing this patch is potentially supposed to
change is which devfns get enumerated and thus config space accessed
looking for a device. And that commit talks about accessing non
existent devices causing a system hang so that does seem fitting in
principle.

> > 
--- snip ---
> > Could you try redoing the test with the AHCI hang but add a print of
> > the affected bus/device/function that AHCI thinks it is probing? Then
> > if the above theory applies we should see it trying to probe on a
> > device that is missing in the correctly booted case and got past your
> > existing checks.
> By redoing this test we found there is only one AHCI detected, and the
> BDF is: bus=0, device=8, fun=0.
> 
> With or without this patch, only one AHCI. But without this patch, the
> AHCI initialization doesn't hang.
> 


This is all very odd. Just so there is no chance of misunderstanding.
You did check the BDF that the ahci driver is trying to probe directly?
I.e. something like what I added as the top commit here:
https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/log/?h=loongarch_debug

This is because with the AHCI controller having a devfn 08.0 devfn and
likely dev->multifunction not set this patch would make a difference in
that it would try to enumerate 08.1 and so on while without this  patch
these would be skipped because of the dev && !dev->multifunction
condition even though isolated function probing should look at all
functions. And I was thinking maybe this causes us to end up trying to
probe an AHCI controller where there is none.

Another thing I could imagine, especially with commit 2410e3301fcc
("PCI: loongson: Don't access non-existent devices") in mind is that
accessing the device/vendor config space for some non existent devices
leaves your PCIe controller in some bad state and then the MMIOs for
the AHCI enable go lost or something. Maybe you could add debug code in
the relevant parts of drivers/pci/controller/pci-loongson.c to check
which devices get accessed with this patch vs without it? Would it help
if I provided a debug patch for that? Though I really don't know what
part is relevant for the system you're seeing the problem with.

Thanks,
Niklas

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ