[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAAhV-H5-6t=4fybED5x7bRQWSdrS_578oF3=_OY4cr5yGsxqQA@mail.gmail.com>
Date: Wed, 24 Dec 2025 17:12:00 +0800
From: Huacai Chen <chenhuacai@...nel.org>
To: Niklas Schnelle <schnelle@...ux.ibm.com>
Cc: Tianrui Zhao <zhaotianrui@...ngson.cn>, Bibo Mao <maobibo@...ngson.cn>,
Bjorn Helgaas <bhelgaas@...gle.com>, Jan Kiszka <jan.kiszka@...mens.com>,
linux-s390 <linux-s390@...r.kernel.org>, loongarch@...ts.linux.dev,
Farhan Ali <alifm@...ux.ibm.com>, Matthew Rosato <mjrosato@...ux.ibm.com>,
Gerald Schaefer <gerald.schaefer@...ux.ibm.com>, Heiko Carstens <hca@...ux.ibm.com>,
Vasily Gorbik <gor@...ux.ibm.com>, Alexander Gordeev <agordeev@...ux.ibm.com>,
Sven Schnelle <svens@...ux.ibm.com>, Christian Borntraeger <borntraeger@...ux.ibm.com>,
Gerd Bayer <gbayer@...ux.ibm.com>, linux-kernel@...r.kernel.org,
linux-pci@...r.kernel.org
Subject: Re: [PATCH v5 1/2] PCI: Fix isolated PCI function probing with ARI
and SR-IOV
On Thu, Dec 18, 2025 at 8:03 PM Niklas Schnelle <schnelle@...ux.ibm.com> wrote:
>
> On Wed, 2025-12-17 at 14:55 +0800, Huacai Chen wrote:
> > On Thu, Dec 4, 2025 at 5:45 AM Niklas Schnelle <schnelle@...ux.ibm.com> wrote:
> > >
> > > On Mon, 2025-12-01 at 22:45 +0800, Huacai Chen wrote:
> > > >
> > > --- snip ---
> > > > You said that "it feels like this is just a hack to probe an odd
> > > > topology". Yes, to some extent you are right.
> > > >
> > > > 1, One of our SoC (LS2K3000) has a special device which has func1 but
> > > > without func0. To let the PCI core scan func1 we can only make
> > > > hypervisor_isolated_pci_functions() return true.
> > > > 2, In the above case, PCI_SCAN_ALL_PCIE_DEVS has no help.
> > > > 3, Though we change hypervisor_isolated_pci_functions() to resolve the
> > > > above problem, it also lets us pass isolated PCI functions to a guest
> > > > OS instance.
> > > >
> > > > As a summary, for real machines commit a02fd05661d73a850 is a hack to
> > > > probe an odd device, for virtual machines it allows passing isolated
> > > > PCI functions.
> > >
> > > Ok, thanks for the answer. So let's see how we can debug this and get
> > > to a solution that works for both of us. Looking around a bit I see
> > > that your pci_loongson_map_bus() has some special handling for trying
> > > not to access non-existent devices added by your commit 2410e3301fcc
> > > ("PCI: loongson: Don't access non-existent devices"). I wonder if with
> > > this patch applied we're running into this same issue but with a devfn
> > > that was previously not tried and is not covered by your checks? And
> > > maybe since your root complex doesn't return 0xff for these non-
> > > existent devices we could end up trying to probe AHCI on such an empty
> > > slot misinterpreting whatever it returns as matching device/vendor?
> > Commit 2410e3301fcc seems to have no relationship with current problems.
>
> I'm not so sure. The only thing this patch is potentially supposed to
> change is which devfns get enumerated and thus config space accessed
> looking for a device. And that commit talks about accessing non
> existent devices causing a system hang so that does seem fitting in
> principle.
>
> > >
> --- snip ---
> > > Could you try redoing the test with the AHCI hang but add a print of
> > > the affected bus/device/function that AHCI thinks it is probing? Then
> > > if the above theory applies we should see it trying to probe on a
> > > device that is missing in the correctly booted case and got past your
> > > existing checks.
> > By redoing this test we found there is only one AHCI detected, and the
> > BDF is: bus=0, device=8, fun=0.
> >
> > With or without this patch, only one AHCI. But without this patch, the
> > AHCI initialization doesn't hang.
> >
>
>
> This is all very odd. Just so there is no chance of misunderstanding.
> You did check the BDF that the ahci driver is trying to probe directly?
> I.e. something like what I added as the top commit here:
> https://git.kernel.org/pub/scm/linux/kernel/git/niks/linux.git/log/?h=loongarch_debug
We check the BDF in ahci_init_one().
And with your repo, the boot log is like this:
[ 10.454172] ahci 0000:00:08.0: ahci_enable_ahci() hung
[ 10.459292] ------------[ cut here ]------------
[ 10.463876] WARNING: drivers/ata/libahci.c:459 at
ahci_save_initial_config+0x3d8/0x448, CPU#0: kworker/0:2/253
[ 10.473824] Modules linked in:
[ 10.476856] CPU: 0 UID: 0 PID: 253 Comm: kworker/0:2 Not tainted
6.19.0-rc1+ #1 PREEMPT(full)
[ 10.485416] Hardware name: To be filled by O.E.M.To be fill To be
filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS
Loongson-UDK2018-V4.0.
[ 10.499160] Workqueue: events work_for_cpu_fn
[ 10.503489] pc 900000000104e3f8 ra 900000000104e3f8 tp
900000010fb74000 sp 900000010fb77c20
[ 10.511788] a0 000000000000002a a1 90000000027223c0 a2
900000010fb77968 a3 90000000027223c8
[ 10.520089] a4 90000000027223c0 a5 900000010fb77960 a6
0000000000000001 a7 0000000000000001
[ 10.528390] t0 09c296afbf53694b t1 09c296afbf53694b t2
ffffffffffffffff t3 0000000000000001
[ 10.536693] t4 fffffffffffffffe t5 0000000000000332 t6
0000000000000005 t7 00000000000011bd
[ 10.544993] t8 0000000000000000 u0 9000000000232db0 s9
900000000226a4a8 s0 9000000100e7ea30
[ 10.553293] s1 ffff80003243d800 s2 90000001012af0b8 s3
900000010fb77cc8 s4 9000000002269fb8
[ 10.561592] s5 0000000000000005 s6 9000000002429000 s7
90000001012af0b8 s8 900000010fb77d08
[ 10.569890] ra: 900000000104e3f8 ahci_save_initial_config+0x3d8/0x448
[ 10.576551] ERA: 900000000104e3f8 ahci_save_initial_config+0x3d8/0x448
[ 10.583212] CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[ 10.589359] PRMD: 00000004 (PPLV0 +PIE -PWE)
[ 10.593687] EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[ 10.598447] ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
[ 10.603207] ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0)
[ 10.608744] PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
[ 10.614625] CPU: 0 UID: 0 PID: 253 Comm: kworker/0:2 Not tainted
6.19.0-rc1+ #1 PREEMPT(full)
[ 10.614628] Hardware name: To be filled by O.E.M.To be fill To be
filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS
Loongson-UDK2018-V4.0.
[ 10.614630] Workqueue: events work_for_cpu_fn
[ 10.614633] Stack : 900000010fb77858 0000000000000000
900000000024371c 900000010fb74000
[ 10.614637] 900000010fb77830 900000010fb77838
0000000000000000 900000010fb77978
[ 10.614639] 900000010fb77970 900000010fb77970
900000010fb77650 0000000000000001
[ 10.614642] 0000000000000001 900000010fb77838
09c296afbf53694b 9000000100414300
[ 10.614645] 80000000ffffe349 fffffffffffffffe
000000000000034b 000000000000002f
[ 10.614648] 0000000000000063 0000000000000001
000000000672c000 900000000226a4a8
[ 10.614651] 0000000000000000 0000000000000000
90000000020fd980 9000000002429000
[ 10.614654] 00000000000001cb 0000000000000009
0000000000000002 90000001012af0b8
[ 10.614657] 900000010fb77d08 0000000000000000
9000000000243734 0000000000000000
[ 10.614659] 00000000000000b0 0000000000000004
0000000000000000 0000000000071c1d
[ 10.614662] ...
[ 10.614664] Call Trace:
[ 10.614665] [<9000000000243734>] show_stack+0x5c/0x180
[ 10.614669] [<900000000023f3c4>] dump_stack_lvl+0x6c/0x9c
[ 10.614674] [<9000000000267168>] __warn+0x90/0x128
[ 10.614679] [<90000000017e983c>] __report_bug+0x84/0x198
[ 10.614683] [<90000000017e9a4c>] report_bug+0x3c/0xc0
[ 10.614686] [<9000000001823910>] do_bp+0x2d0/0x340
[ 10.614690] [<9000000000241e00>] handle_bp+0x120/0x1c0
[ 10.614692] [<900000000104e3f8>] ahci_save_initial_config+0x3d8/0x448
[ 10.614695] [<900000000104b374>] ahci_init_one+0x354/0x1068
[ 10.614699] [<9000000000d2343c>] local_pci_probe+0x44/0xb8
[ 10.614703] [<9000000000286bf8>] work_for_cpu_fn+0x18/0x30
[ 10.614706] [<900000000028a4e0>] process_one_work+0x160/0x330
[ 10.614709] [<900000000028aee8>] worker_thread+0x338/0x468
[ 10.614712] [<9000000000295ef4>] kthread+0x11c/0x138
[ 10.614716] [<9000000001823a58>] ret_from_kernel_thread+0x28/0xd0
[ 10.614717] [<90000000002414e4>] ret_from_kernel_thread_asm+0xc/0x88
[ 10.614720]
[ 10.614720] ---[ end trace 0000000000000000 ]---
[ 10.818394] ahci 0000:00:08.0: forcing PORTS_IMPL to 0x1
[ 10.902172] ahci 0000:00:08.0: ahci_enable_ahci() hung
[ 10.907282] ------------[ cut here ]------------
[ 10.911866] WARNING: drivers/ata/libahci.c:994 at
ahci_reset_controller+0x88/0x1d8, CPU#0: kworker/0:2/253
[ 10.921466] Modules linked in:
[ 10.924496] CPU: 0 UID: 0 PID: 253 Comm: kworker/0:2 Tainted: G
W 6.19.0-rc1+ #1 PREEMPT(full)
[ 10.934611] Tainted: [W]=WARN
[ 10.937551] Hardware name: To be filled by O.E.M.To be fill To be
filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS
Loongson-UDK2018-V4.0.
[ 10.951294] Workqueue: events work_for_cpu_fn
[ 10.955622] pc 900000000104e4f0 ra 900000000104e4f0 tp
900000010fb74000 sp 900000010fb77c10
[ 10.963922] a0 000000000000002a a1 90000000027223c0 a2
900000010fb77958 a3 90000000027223c8
[ 10.972221] a4 90000000027223c0 a5 900000010fb77950 a6
0000000000000001 a7 0000000000000001
[ 10.980519] t0 09c296afbf53694b t1 09c296afbf53694b t2
ffffffffffffffff t3 0000000000000001
[ 10.988818] t4 fffffffffffffffe t5 000000000000036b t6
0000000000000005 t7 000000000000233d
[ 10.997117] t8 0000000000000000 u0 9000000000232db0 s9
90000001001f0000 s0 900000010000dc80
[ 11.005416] s1 ffff80003243d800 s2 9000000100e7ea30 s3
9000000100e7ea30 s4 9000000002269fb8
[ 11.013715] s5 90000001012af0b8 s6 9000000002429000 s7
90000001012af0b8 s8 0000000000000001
[ 11.022014] ra: 900000000104e4f0 ahci_reset_controller+0x88/0x1d8
[ 11.028328] ERA: 900000000104e4f0 ahci_reset_controller+0x88/0x1d8
[ 11.034641] CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[ 11.040787] PRMD: 00000004 (PPLV0 +PIE -PWE)
[ 11.045116] EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[ 11.049877] ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
[ 11.054636] ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0)
[ 11.060173] PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
[ 11.066053] CPU: 0 UID: 0 PID: 253 Comm: kworker/0:2 Tainted: G
W 6.19.0-rc1+ #1 PREEMPT(full)
[ 11.066055] Tainted: [W]=WARN
[ 11.066056] Hardware name: To be filled by O.E.M.To be fill To be
filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS
Loongson-UDK2018-V4.0.
[ 11.066057] Workqueue: events work_for_cpu_fn
[ 11.066060] Stack : 900000010fb77848 0000000000000000
900000000024371c 900000010fb74000
[ 11.066063] 900000010fb77820 900000010fb77828
0000000000000000 900000010fb77968
[ 11.066066] 900000010fb77960 900000010fb77960
900000010fb77640 0000000000000001
[ 11.066068] 0000000000000001 900000010fb77828
09c296afbf53694b 9000000100414300
[ 11.066071] 80000000ffffe384 fffffffffffffffe
0000000000000386 000000000000002f
[ 11.066074] 0000000000000063 0000000000000001
000000000672c000 90000001001f0000
[ 11.066077] 0000000000000000 0000000000000000
90000000020fd980 9000000002429000
[ 11.066079] 00000000000003e2 0000000000000009
0000000000000002 90000001012af0b8
[ 11.066082] 0000000000000001 0000000000000000
9000000000243734 0000000000000000
[ 11.066085] 00000000000000b0 0000000000000004
0000000000000000 0000000000071c1d
[ 11.066088] ...
[ 11.066089] Call Trace:
[ 11.066089] [<9000000000243734>] show_stack+0x5c/0x180
[ 11.066091] [<900000000023f3c4>] dump_stack_lvl+0x6c/0x9c
[ 11.066095] [<9000000000267168>] __warn+0x90/0x128
[ 11.066098] [<90000000017e983c>] __report_bug+0x84/0x198
[ 11.066102] [<90000000017e9a4c>] report_bug+0x3c/0xc0
[ 11.066105] [<9000000001823910>] do_bp+0x2d0/0x340
[ 11.066106] [<9000000000241e00>] handle_bp+0x120/0x1c0
[ 11.066108] [<900000000104e4f0>] ahci_reset_controller+0x88/0x1d8
[ 11.066111] [<900000000104a58c>] ahci_pci_reset_controller+0x2c/0xd8
[ 11.066114] [<900000000104bc10>] ahci_init_one+0xbf0/0x1068
[ 11.066116] [<9000000000d2343c>] local_pci_probe+0x44/0xb8
[ 11.066119] [<9000000000286bf8>] work_for_cpu_fn+0x18/0x30
[ 11.066122] [<900000000028a4e0>] process_one_work+0x160/0x330
[ 11.066125] [<900000000028aee8>] worker_thread+0x338/0x468
[ 11.066128] [<9000000000295ef4>] kthread+0x11c/0x138
[ 11.066131] [<9000000001823a58>] ret_from_kernel_thread+0x28/0xd0
[ 11.066132] [<90000000002414e4>] ret_from_kernel_thread_asm+0xc/0x88
[ 11.066134]
[ 11.066135] ---[ end trace 0000000000000000 ]---
[ 11.358172] ahci 0000:00:08.0: ahci_enable_ahci() hung
[ 11.363283] ------------[ cut here ]------------
[ 11.367867] WARNING: drivers/ata/libahci.c:1028 at
ahci_reset_controller+0x1cc/0x1d8, CPU#0: kworker/0:2/253
[ 11.377638] Modules linked in:
[ 11.380668] CPU: 0 UID: 0 PID: 253 Comm: kworker/0:2 Tainted: G
W 6.19.0-rc1+ #1 PREEMPT(full)
[ 11.390783] Tainted: [W]=WARN
[ 11.393723] Hardware name: To be filled by O.E.M.To be fill To be
filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS
Loongson-UDK2018-V4.0.
[ 11.407466] Workqueue: events work_for_cpu_fn
[ 11.411792] pc 900000000104e634 ra 900000000104e634 tp
900000010fb74000 sp 900000010fb77c10
[ 11.420092] a0 000000000000002a a1 90000000027223c0 a2
900000010fb77958 a3 90000000027223c8
[ 11.428391] a4 90000000027223c0 a5 900000010fb77950 a6
0000000000000001 a7 0000000000000001
[ 11.436691] t0 09c296afbf53694b t1 09c296afbf53694b t2
ffffffffffffffff t3 0000000000000001
[ 11.444990] t4 fffffffffffffffe t5 00000000000003a6 t6
0000000000000005 t7 0000000000000dfd
[ 11.453288] t8 0000000000000000 u0 9000000000232db0 s9
90000001001f0000 s0 900000010000dc80
[ 11.461587] s1 ffff80003243d800 s2 9000000100e7ea30 s3
9000000100e7ea30 s4 9000000002269fb8
[ 11.469885] s5 90000001012af0b8 s6 9000000002429000 s7
90000001012af0b8 s8 0000000000000001
[ 11.478185] ra: 900000000104e634 ahci_reset_controller+0x1cc/0x1d8
[ 11.484586] ERA: 900000000104e634 ahci_reset_controller+0x1cc/0x1d8
[ 11.490987] CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
[ 11.497131] PRMD: 00000004 (PPLV0 +PIE -PWE)
[ 11.501458] EUEN: 00000000 (-FPE -SXE -ASXE -BTE)
[ 11.506218] ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
[ 11.510977] ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0)
[ 11.516512] PRID: 0014d010 (Loongson-64bit, Loongson-3C6000/S)
[ 11.522393] CPU: 0 UID: 0 PID: 253 Comm: kworker/0:2 Tainted: G
W 6.19.0-rc1+ #1 PREEMPT(full)
[ 11.522395] Tainted: [W]=WARN
[ 11.522396] Hardware name: To be filled by O.E.M.To be fill To be
filled by O.E.M.To be fill/To be filled by O.E.M.To be fill, BIOS
Loongson-UDK2018-V4.0.
[ 11.522397] Workqueue: events work_for_cpu_fn
[ 11.522400] Stack : 900000010fb77848 0000000000000000
900000000024371c 900000010fb74000
[ 11.522403] 900000010fb77820 900000010fb77828
0000000000000000 900000010fb77968
[ 11.522406] 900000010fb77960 900000010fb77960
900000010fb77640 0000000000000001
[ 11.522409] 0000000000000001 900000010fb77828
09c296afbf53694b 9000000100414300
[ 11.522412] 80000000ffffe3bf fffffffffffffffe
00000000000003c1 000000000000002f
[ 11.522415] 0000000000000063 0000000000000001
000000000672c000 90000001001f0000
[ 11.522418] 0000000000000000 0000000000000000
90000000020fd980 9000000002429000
[ 11.522420] 0000000000000404 0000000000000009
0000000000000002 90000001012af0b8
[ 11.522423] 0000000000000001 0000000000000000
9000000000243734 0000000000000000
[ 11.522426] 00000000000000b0 0000000000000004
0000000000000000 0000000000071c1d
[ 11.522428] ...
[ 11.522430] Call Trace:
[ 11.522430] [<9000000000243734>] show_stack+0x5c/0x180
[ 11.522432] [<900000000023f3c4>] dump_stack_lvl+0x6c/0x9c
[ 11.522436] [<9000000000267168>] __warn+0x90/0x128
[ 11.522439] [<90000000017e983c>] __report_bug+0x84/0x198
[ 11.522442] [<90000000017e9a4c>] report_bug+0x3c/0xc0
[ 11.522445] [<9000000001823910>] do_bp+0x2d0/0x340
[ 11.522447] [<9000000000241e00>] handle_bp+0x120/0x1c0
[ 11.522448] [<900000000104e634>] ahci_reset_controller+0x1cc/0x1d8
[ 11.522451] [<900000000104a58c>] ahci_pci_reset_controller+0x2c/0xd8
[ 11.522454] [<900000000104bc10>] ahci_init_one+0xbf0/0x1068
[ 11.522456] [<9000000000d2343c>] local_pci_probe+0x44/0xb8
[ 11.522459] [<9000000000286bf8>] work_for_cpu_fn+0x18/0x30
[ 11.522462] [<900000000028a4e0>] process_one_work+0x160/0x330
[ 11.522465] [<900000000028aee8>] worker_thread+0x338/0x468
[ 11.522467] [<9000000000295ef4>] kthread+0x11c/0x138
[ 11.522470] [<9000000001823a58>] ret_from_kernel_thread+0x28/0xd0
[ 11.522471] [<90000000002414e4>] ret_from_kernel_thread_asm+0xc/0x88
[ 11.522473]
[ 11.522474] ---[ end trace 0000000000000000 ]---
[ 11.736718] ahci 0000:00:08.0: AHCI vers 0000.0000, 1 command
slots, ? Gbps, unknown mode
[ 11.744847] ahci 0000:00:08.0: 1/1 ports implemented (port mask 0x1)
[ 11.751159] ahci 0000:00:08.0: flags:
[ 11.755115] scsi host0: ahci
[ 11.758048] ata1: SATA max UDMA/133 abar m1024@...0030401800 port
0xe0030401900 irq 44 lpm-pol 1
[ 11.767294] e1000: Intel(R) PRO/1000 Network Driver
[ 11.772145] e1000: Copyright (c) 1999-2006 Intel Corporation.
[ 11.777870] e1000e: Intel(R) PRO/1000 Network Driver
[ 11.782800] e1000e: Copyright(c) 1999 - 2015 Intel Corporation.
Huacai
>
> This is because with the AHCI controller having a devfn 08.0 devfn and
> likely dev->multifunction not set this patch would make a difference in
> that it would try to enumerate 08.1 and so on while without this patch
> these would be skipped because of the dev && !dev->multifunction
> condition even though isolated function probing should look at all
> functions. And I was thinking maybe this causes us to end up trying to
> probe an AHCI controller where there is none.
>
> Another thing I could imagine, especially with commit 2410e3301fcc
> ("PCI: loongson: Don't access non-existent devices") in mind is that
> accessing the device/vendor config space for some non existent devices
> leaves your PCIe controller in some bad state and then the MMIOs for
> the AHCI enable go lost or something. Maybe you could add debug code in
> the relevant parts of drivers/pci/controller/pci-loongson.c to check
> which devices get accessed with this patch vs without it? Would it help
> if I provided a debug patch for that? Though I really don't know what
> part is relevant for the system you're seeing the problem with.
>
> Thanks,
> Niklas
Powered by blists - more mailing lists