linux-kernel - [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <d4084296-9d36-64ec-8a79-77d82ac6d31c@canonical.com>
Date:   Mon, 13 Sep 2021 18:31:02 +1200
From:   Matthew Ruffell <matthew.ruffell@...onical.com>
To:     linux-pci@...r.kernel.org
Cc:     lkml <linux-kernel@...r.kernel.org>, alex.williamson@...hat.com,
        kvm@...r.kernel.org
Subject: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through
 2x GPUs that share same pci switch via vfio

Dear PCI, KVM and VFIO Subsystem Maintainers,



I have a user which can reliably reproduce a host lockup when passing 2x GPUs to

a KVM guest via vfio-pci, where the two GPUs each share the same PCI switch. If

the user passes through multiple GPUs, and selects them such that no GPU shares

the same PCI switch as any other GPU, the system is stable.



System Information:

- SuperMicro X9DRG-O(T)F

- 8x Nvidia GeForce RTX 2080 Ti GPUs

- Ubuntu 20.04 LTS

- 5.14.0 mainline kernel

- libvirt 6.0.0-0ubuntu8.10

- qemu 4.2-3ubuntu6.16



Kernel command line:

Command line: BOOT_IMAGE=/vmlinuz-5.14-051400-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro intel_iommu=on hugepagesz=1G hugepages=240 kvm.report_ignored_msrs=0 kvm.ignore_msrs=1 vfio-pci.ids=10de:1e04,10de:10f7,10de:1ad6,10de:1ad7 console=ttyS1,115200n8 ignore_loglevel crashkernel=512M



lspci -vvv ran as root under kernel 5.14.0 is available in the pastebin below,

and also attached to this message.

https://paste.ubuntu.com/p/TVNvvXC7Z9/



lspci -tv ran as root available in the pastebin below:

https://paste.ubuntu.com/p/52Y69PbjZg/



The symptoms are:



When multiple GPUs are passed through to a KVM guest via pci-vfio, and if a

pair of GPUs are passed through which share the same PCI switch, if you start

the VM, panic the VM / force restart the VM, and keep looping, eventually the

host will have the following kernel oops:



irq 31: nobody cared (try booting with the "irqpoll" option)

CPU: 23 PID: 0 Comm: swapper/23 Kdump: loaded Not tainted 5.14-051400-generic #202108310811-Ubuntu

Hardware name: Supermicro X9DRG-O(T)F/X9DRG-O(T)F, BIOS 3.3  11/27/2018

Call Trace:

 <IRQ>

 dump_stack_lvl+0x4a/0x5f

 dump_stack+0x10/0x12

 __report_bad_irq+0x3a/0xaf

 note_interrupt.cold+0xb/0x60

 handle_irq_event_percpu+0x72/0x80

 handle_irq_event+0x3b/0x60

 handle_fasteoi_irq+0x9c/0x150

 __common_interrupt+0x4b/0xb0

 common_interrupt+0x4a/0xa0

 asm_common_interrupt+0x1e/0x40

RIP: 0010:__do_softirq+0x73/0x2ae

Code: 7b 61 4c 00 01 00 00 89 75 a8 c7 45 ac 0a 00 00 00 48 89 45 c0 48 89 45 b0 65 66 c7 05 54 c7 62 4c 00 00 fb 66 0f 1f 44 00 00 <bb> ff ff ff ff 49 c7 c7 c0 60 80 b4 41 0f bc de 83 c3 01 89 5d d4

RSP: 0018:ffffba440cc04f80 EFLAGS: 00000286

RAX: ffff93c5a0929880 RBX: 0000000000000000 RCX: 00000000000006e0

RDX: 0000000000000001 RSI: 0000000004200042 RDI: ffff93c5a1104980

RBP: ffffba440cc04fd8 R08: 0000000000000000 R09: 000000f47ad6e537

R10: 000000f47a99de21 R11: 000000f47a99dc37 R12: ffffba440c68be08

R13: 0000000000000001 R14: 0000000000000200 R15: 0000000000000000

 irq_exit_rcu+0x8d/0xa0

 sysvec_apic_timer_interrupt+0x7c/0x90

 </IRQ>

 asm_sysvec_apic_timer_interrupt+0x12/0x20

RIP: 0010:tick_nohz_idle_enter+0x47/0x50

Code: 30 4b 4d 48 83 bb b0 00 00 00 00 75 20 80 4b 4c 01 e8 5d 0c ff ff 80 4b 4c 04 48 89 43 78 e8 50 e8 f8 ff fb 66 0f 1f 44 00 00 <5b> 5d c3 0f 0b eb dc 66 90 0f 1f 44 00 00 55 48 89 e5 53 48 c7 c3

RSP: 0018:ffffba440c68beb0 EFLAGS: 00000213

RAX: 000000f5424040a4 RBX: ffff93e51fadf680 RCX: 000000000000001f

RDX: 0000000000000000 RSI: 000000002f684d00 RDI: ffe8b4bb6b90380b

RBP: ffffba440c68beb8 R08: 000000f5424040a4 R09: 0000000000000001

R10: ffffffffb4875460 R11: 0000000000000017 R12: 0000000000000093

R13: ffff93c5a0929880 R14: 0000000000000000 R15: 0000000000000000

 do_idle+0x47/0x260

 ? do_idle+0x197/0x260

 cpu_startup_entry+0x20/0x30

 start_secondary+0x127/0x160

 secondary_startup_64_no_verify+0xc2/0xcb

handlers:

[<00000000b16da31d>] vfio_intx_handler

Disabling IRQ #31



The IRQs which this occurs on are: 25, 27, 106, 31, 29. These represent the

PEX 8747 PCIe switches present in the system:



*-pci

    description: PCI bridge

    product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

    vendor: PLX Technology, Inc.

    bus info: pci@...0:02:00.0

    capabilities: pci msi pciexpress 

    configuration: driver=pcieport

    resources: irq:25 

    

*-pci

    description: PCI bridge

    product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

    vendor: PLX Technology, Inc.

    bus info: pci@...0:06:00.0

    capabilities: pci msi pciexpress

    configuration: driver=pcieport

    resources: irq:27 

    

*-pci

    description: PCI bridge

    product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

    vendor: PLX Technology, Inc.

    bus info: pci@...0:82:00.0

    capabilities: pci msi pciexpress 

    configuration: driver=pcieport

    resources: irq:29 

 

*-pci

    description: PCI bridge

    product: PEX 8747 48-Lane, 5-Port PCI Express Gen 3 (8.0 GT/s) Switch

    vendor: PLX Technology, Inc.

    bus info: pci@...0:86:00.0

    capabilities: pci msi pciexpress

    configuration: driver=pcieport

    resources: irq:31 

    

When the system hits the kernel oops, the host crashes and the crashkernel

boots, but it gets stuck initialising the IOMMU:



DMAR: Host address width 46

DMAR: DRHD base: 0x000000fbffe000 flags: 0x0

DMAR: dmar0: reg_base_addr fbffe000 ver 1:0 cap d2078c106f0466 ecap f020de

DMAR: DRHD base: 0x000000cbffc000 flags: 0x1

DMAR: dmar1: reg_base_addr cbffc000 ver 1:0 cap d2078c106f0466 ecap f020de

DMAR: RMRR base: 0x0000005f21a000 end: 0x0000005f228fff

DMAR: ATSR flags: 0x0

DMAR: RHSA base: 0x000000fbffe000 proximity domain: 0x1

DMAR: RHSA base: 0x000000cbffc000 proximity domain: 0x0

DMAR-IR: IOAPIC id 3 under DRHD base  0xfbffe000 IOMMU 0

DMAR-IR: IOAPIC id 0 under DRHD base  0xcbffc000 IOMMU 1

DMAR-IR: IOAPIC id 2 under DRHD base  0xcbffc000 IOMMU 1

DMAR-IR: HPET id 0 under DRHD base 0xcbffc000

[    3.271530] DMAR-IR: Queued invalidation will be enabled to support x2apic and Intr-remapping.

[    3.282572] DMAR-IR: Copied IR table for dmar0 from previous kernel

[   13.291319] DMAR-IR: Copied IR table for dmar1 from previous kernel



The crashkernel then hard locks, and the system must be manually rebooted. Note

that it took ten seconds to copy the IR table for dmar1, which is most unusual.

If we do a sysrq-trigger, there is no ten second delay, and the very next

message is:



DMAR-IR: Enabled IRQ remapping in x2apic mode



Which leads us to believe that we are getting stuck in the crashkernel copying

the IR table and re-enabling the IRQ that was disabled from "nobody cared"

and globally enabling IRQ remapping.



Things we have tried:



We have tried adding vfio-pci.nointxmask=1 to the kernel command line, but we

cannot start a VM where the GPUs shares the same PCI switch, instead we get a 

libvirt error:



Fails to start: vfio 0000:05:00.0: Failed to set up TRIGGER eventfd signaling for interrupt INTX-0: VFIO_DEVICE_SET_IRQS failure: Device or resource busy



Starting a VM with GPUs all from different PCI switches works just fine.



We tried adding "options snd-hda-intel enable_msi=1" to /etc/modprobe.d/snd-hda-intel.conf,

and while it did enable MSI for all PCI devices under each GPU, MSI is still

disabled on each of the PLX PCI switches, and the issue still reproduces when

GPUs share PCI switches.



We have ruled out ACS issues, as each PLX PCI switch and Nvidia GPU are 

allocated their own isolated IOMMU group:



https://paste.ubuntu.com/p/9VRt2zrqRR/



Looking at the initial kernel oops, we seem to hit __report_bad_irq(), which

means that we have ignored 99,900 of these interrupts coming from the PCI switch,

and that the vfio_intx_handler() doesn't process them, likely because the PCI

switch itself is not passed through to the VM, only the VGA PCI devices are.



184 /*

185  * If 99,900 of the previous 100,000 interrupts have not been handled

186  * then assume that the IRQ is stuck in some manner. Drop a diagnostic

187  * and try to turn the IRQ off.

188  *

189  * (The other 100-of-100,000 interrupts may have been a correctly

190  *  functioning device sharing an IRQ with the failing one)

191  */

192 static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)

193 {

194     unsigned int irq = irq_desc_get_irq(desc);

195     struct irqaction *action;

196     unsigned long flags;

197 

198     if (bad_action_ret(action_ret)) {

199         printk(KERN_ERR "irq event %d: bogus return value %x\n",

200                 irq, action_ret);

201     } else {

202         printk(KERN_ERR "irq %d: nobody cared (try booting with "

203                 "the \"irqpoll\" option)\n", irq);

204     }

205     dump_stack();

206     printk(KERN_ERR "handlers:\n");

207 

208     /*

209      * We need to take desc->lock here. note_interrupt() is called

210      * w/o desc->lock held, but IRQ_PROGRESS set. We might race

211      * with something else removing an action. It's ok to take

212      * desc->lock here. See synchronize_irq().

213      */

214     raw_spin_lock_irqsave(&desc->lock, flags);

215     for_each_action_of_desc(desc, action) {

216         printk(KERN_ERR "[<%p>] %ps", action->handler, action->handler);

217         if (action->thread_fn)

218             printk(KERN_CONT " threaded [<%p>] %ps",

219                     action->thread_fn, action->thread_fn);

220         printk(KERN_CONT "\n");

221     }

222     raw_spin_unlock_irqrestore(&desc->lock, flags);

223 }



Any help with debugging this issue would be greatly appreciated. We are able

to gather any information requested, and can test patches or debug patches.



Thanks,

Matthew Ruffell

View attachment "lspci_vvv_2021-09-10.txt" of type "text/plain" (336016 bytes)