[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d4084296-9d36-64ec-8a79-77d82ac6d31c@canonical.com>
Date: Mon, 13 Sep 2021 18:31:02 +1200
From: Matthew Ruffell <matthew.ruffell@...onical.com>
To: linux-pci@...r.kernel.org
Cc: lkml <linux-kernel@...r.kernel.org>, alex.williamson@...hat.com,
kvm@...r.kernel.org
Subject: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through
2x GPUs that share same pci switch via vfio
Dear PCI, KVM and VFIO Subsystem Maintainers,
I have a user which can reliably reproduce a host lockup when passing 2x GPUs to
a KVM guest via vfio-pci, where the two GPUs each share the same PCI switch. If
the user passes through multiple GPUs, and selects them such that no GPU shares
the same PCI switch as any other GPU, the system is stable.
System Information:
- SuperMicro X9DRG-O(T)F
- 8x Nvidia GeForce RTX 2080 Ti GPUs
- Ubuntu 20.04 LTS
- 5.14.0 mainline kernel
- libvirt 6.0.0-0ubuntu8.10
- qemu 4.2-3ubuntu6.16
Kernel command line:
Command line: BOOT_IMAGE=/vmlinuz-5.14-051400-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro intel_iommu=on hugepagesz=1G hugepages=240 kvm.report_ignored_msrs=0 kvm.ignore_msrs=1 vfio-pci.ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 console=ttyS1,115200n8 ignore_loglevel crashkernel=512M
lspci -vvv ran as root under kernel 5.14.0 is available in the pastebin below,
and also attached to this message.
https://paste.ubuntu.com/p/TVNvvXC7Z9/
lspci -tv ran as root available in the pastebin below:
https://paste.ubuntu.com/p/52Y69PbjZg/
The symptoms are:
When multiple GPUs are passed through to a KVM guest via pci-vfio, and if a
pair of GPUs are passed through which share the same PCI switch, if you start
the VM, panic the VM / force restart the VM, and keep looping, eventually the
host will have the following kernel oops:
irq 31: nobody cared (try booting with the "irqpoll" option)
CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Not tainted 5.14-051400-generic #202108310811-Ubuntu
Hardware name: Supermicro X9DRG-O(T)F/X9DRG-O(T)F, BIOS 3.3 11/27/2018
Call Trace:
<IRQ>
dump_stack_lvl+0x4a/0x5f
dump_stack+0x10/0x12
__report_bad_irq+0x3a/0xaf
note_interrupt.cold+0xb/0x60
handle_irq_event_percpu+0x72/0x80
handle_irq_event+0x3b/0x60
handle_fasteoi_irq+0x9c/0x150
__common_interrupt+0x4b/0xb0
common_interrupt+0x4a/0xa0
asm_common_interrupt+0x1e/0x40
RIP: 0010:__do_softirq+0x73/0x2ae
Code: 7b 61 4c 00 01 00 00 89 75 a8 c7 45 ac 0a 00 00 00 48 89 45 c0 48 89 45 b0 65 66 c7 05 54 c7 62 4c 00 00 fb 66 0f 1f 44 00 00 <bb> ff ff ff ff 49 c7 c7 c0 60 80 b4 41 0f bc de 83 c3 01 89 5d d4
RSP: 0018:ffffba440cc04f80 EFLAGS: 00000286
RAX: ffff93c5a0929880 RBX: 0000000000000000 RCX: 00000000000006e0
RDX: 0000000000000001 RSI: 0000000004200042 RDI: ffff93c5a1104980
RBP: ffffba440cc04fd8 R08: 0000000000000000 R09: 000000f47ad6e537
R10: 000000f47a99de21 R11: 000000f47a99dc37 R12: ffffba440c68be08
R13: 0000000000000001 R14: 0000000000000200 R15: 0000000000000000
irq_exit_rcu+0x8d/0xa0
sysvec_apic_timer_interrupt+0x7c/0x90
</IRQ>
asm_sysvec_apic_timer_interrupt+0x12/0x20
RIP: 0010:tick_nohz_idle_enter+0x47/0x50
Code: 30 4b 4d 48 83 bb b0 00 00 00 00 75 20 80 4b 4c 01 e8 5d 0c ff ff 80 4b 4c 04 48 89 43 78 e8 50 e8 f8 ff fb 66 0f 1f 44 00 00 <5b> 5d c3 0f 0b eb dc 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 c7 c3
RSP: 0018:ffffba440c68beb0 EFLAGS: 00000213
RAX: 000000f5424040a4 RBX: ffff93e51fadf680 RCX: 000000000000001f
RDX: 0000000000000000 RSI: 000000002f684d00 RDI: ffe8b4bb6b90380b
RBP: ffffba440c68beb8 R08: 000000f5424040a4 R09: 0000000000000001
R10: ffffffffb4875460 R11: 0000000000000017 R12: 0000000000000093
R13: ffff93c5a0929880 R14: 0000000000000000 R15: 0000000000000000
do_idle+0x47/0x260
? do_idle+0x197/0x260
cpu_startup_entry+0x20/0x30
start_secondary+0x127/0x160
secondary_startup_64_no_verify+0xc2/0xcb
handlers:
[<00000000b16da31d>] vfio_intx_handler
Disabling IRQ #31
The IRQs which this occurs on are: 25, 27, 106, 31, 29. These represent the
PEX 8747 PCIe switches present in the system:
*-pci
description: PCI bridge
product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
vendor: PLX Technology, Inc.
bus info: pci@...0:02:00.0
capabilities: pci msi pciexpress
configuration: driver=pcieport
resources: irq:25
*-pci
description: PCI bridge
product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
vendor: PLX Technology, Inc.
bus info: pci@...0:06:00.0
capabilities: pci msi pciexpress
configuration: driver=pcieport
resources: irq:27
*-pci
description: PCI bridge
product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
vendor: PLX Technology, Inc.
bus info: pci@...0:82:00.0
capabilities: pci msi pciexpress
configuration: driver=pcieport
resources: irq:29
*-pci
description: PCI bridge
product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch
vendor: PLX Technology, Inc.
bus info: pci@...0:86:00.0
capabilities: pci msi pciexpress
configuration: driver=pcieport
resources: irq:31
When the system hits the kernel oops, the host crashes and the crashkernel
boots, but it gets stuck initialising the IOMMU:
DMAR: Host address width 46
DMAR: DRHD base: 0x000000fbffe000 flags: 0x0
DMAR: dmar0: reg_base_addr fbffe000 ver 1:0 cap d2078c106f0466 ecap f020de
DMAR: DRHD base: 0x000000cbffc000 flags: 0x1
DMAR: dmar1: reg_base_addr cbffc000 ver 1:0 cap d2078c106f0466 ecap f020de
DMAR: RMRR base: 0x0000005f21a000 end: 0x0000005f228fff
DMAR: ATSR flags: 0x0
DMAR: RHSA base: 0x000000fbffe000 proximity domain: 0x1
DMAR: RHSA base: 0x000000cbffc000 proximity domain: 0x0
DMAR-IR: IOAPIC id 3 under DRHD base 0xfbffe000 IOMMU 0
DMAR-IR: IOAPIC id 0 under DRHD base 0xcbffc000 IOMMU 1
DMAR-IR: IOAPIC id 2 under DRHD base 0xcbffc000 IOMMU 1
DMAR-IR: HPET id 0 under DRHD base 0xcbffc000
[ 3.271530] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.
[ 3.282572] DMAR-IR: Copied IR table for dmar0 from previous kernel
[ 13.291319] DMAR-IR: Copied IR table for dmar1 from previous kernel
The crashkernel then hard locks, and the system must be manually rebooted. Note
that it took ten seconds to copy the IR table for dmar1, which is most unusual.
If we do a sysrq-trigger, there is no ten second delay, and the very next
message is:
DMAR-IR: Enabled IRQ remapping in x2apic mode
Which leads us to believe that we are getting stuck in the crashkernel copying
the IR table and re-enabling the IRQ that was disabled from "nobody cared"
and globally enabling IRQ remapping.
Things we have tried:
We have tried adding vfio-pci.nointxmask=1 to the kernel command line, but we
cannot start a VM where the GPUs shares the same PCI switch, instead we get a
libvirt error:
Fails to start: vfio 0000:05:00.0: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Device or resource busy
Starting a VM with GPUs all from different PCI switches works just fine.
We tried adding "options snd-hda-intel enable_msi=1" to /etc/modprobe.d/snd-hda-intel.conf,
and while it did enable MSI for all PCI devices under each GPU, MSI is still
disabled on each of the PLX PCI switches, and the issue still reproduces when
GPUs share PCI switches.
We have ruled out ACS issues, as each PLX PCI switch and Nvidia GPU are
allocated their own isolated IOMMU group:
https://paste.ubuntu.com/p/9VRt2zrqRR/
Looking at the initial kernel oops, we seem to hit __report_bad_irq(), which
means that we have ignored 99,900 of these interrupts coming from the PCI switch,
and that the vfio_intx_handler() doesn't process them, likely because the PCI
switch itself is not passed through to the VM, only the VGA PCI devices are.
184 /*
185 * If 99,900 of the previous 100,000 interrupts have not been handled
186 * then assume that the IRQ is stuck in some manner. Drop a diagnostic
187 * and try to turn the IRQ off.
188 *
189 * (The other 100-of-100,000 interrupts may have been a correctly
190 * functioning device sharing an IRQ with the failing one)
191 */
192 static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)
193 {
194 unsigned int irq = irq_desc_get_irq(desc);
195 struct irqaction *action;
196 unsigned long flags;
197
198 if (bad_action_ret(action_ret)) {
199 printk(KERN_ERR "irq event %d: bogus return value %x\n",
200 irq, action_ret);
201 } else {
202 printk(KERN_ERR "irq %d: nobody cared (try booting with "
203 "the \"irqpoll\" option)\n", irq);
204 }
205 dump_stack();
206 printk(KERN_ERR "handlers:\n");
207
208 /*
209 * We need to take desc->lock here. note_interrupt() is called
210 * w/o desc->lock held, but IRQ_PROGRESS set. We might race
211 * with something else removing an action. It's ok to take
212 * desc->lock here. See synchronize_irq().
213 */
214 raw_spin_lock_irqsave(&desc->lock, flags);
215 for_each_action_of_desc(desc, action) {
216 printk(KERN_ERR "[<%p>] %ps", action->handler, action->handler);
217 if (action->thread_fn)
218 printk(KERN_CONT " threaded [<%p>] %ps",
219 action->thread_fn, action->thread_fn);
220 printk(KERN_CONT "\n");
221 }
222 raw_spin_unlock_irqrestore(&desc->lock, flags);
223 }
Any help with debugging this issue would be greatly appreciated. We are able
to gather any information requested, and can test patches or debug patches.
Thanks,
Matthew Ruffell
View attachment "lspci_vvv_2021-09-10.txt" of type "text/plain" (336016 bytes)
Powered by blists - more mailing lists