linux-kernel - Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing through 2x GPUs that share same pci switch via vfio

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKAwkKsoKELnR=--06sRZL3S6_rQVi5J_Kcv6iRQ6w2tY71WCQ@mail.gmail.com>
Date:   Mon, 1 Nov 2021 17:35:04 +1300
From:   Matthew Ruffell <matthew.ruffell@...onical.com>
To:     Alex Williamson <alex.williamson@...hat.com>
Cc:     linux-pci@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>,
        kvm@...r.kernel.org, nathan.langford@...lesunifiedtechnologies.com
Subject: Re: [PROBLEM] Frequently get "irq 31: nobody cared" when passing
 through 2x GPUs that share same pci switch via vfio

Hi Alex,

Nathan has been running a workload on the 5.14 kernel + the test patch, and has
ran into some interesting softlockups and hardlockups.

The first, happened on a secondary server running a Windows VM, with 7 (of 10)
1080TI GPUs passed through.

Full dmesg:
https://paste.ubuntu.com/p/Wx5hCBBXKb/

There isn't any "irq x: nobody cared" messages, and the crashkernel gets stuck
in the usual copying IR tables from dmar, which suggests an ongoing interrupt
storm.

Nathan disabled "kernel.hardlockup_panic = 1" sysctl, and managed to reproduce
the issue again, suggesting that we get stuck in kernel space for too long
without the ability for interrupts to be serviced.

It starts with the NIC hitting a tx queue timeout, and then does a NMI to unwind
the stack of each CPU, although the stacks don't appear to indicate where things
are stuck. The server then remains softlocked, and keeps unwinding stacks every
26 seconds or so, until it eventually hardlockups.

Full dmesg:
https://people.canonical.com/~mruffell/sf314568/1080TI_hardlockup.txt

The next interesting thing to report is when Nathan started the same Windows VM
on the primary host we have been debugging on, with the 8x 2080TI GPUs. Nathan
experienced a stuck VM, with the host responding just fine. When Nathan reset
the VM, he got 4x "irq xx: nobody cared" messages on IRQs 25, 27, 29 and 31,
which at the time corresponded to the PEX 8747 upstream PCI switches.

Interestingly, Nathan also observed 2x GPU Audio devices sharing the same IRQ
line as the upstream PCI switch, although Nathan mentioned this only occured
very briefly, and the GPU audio devices were re-assigned different IRQs shortly
afterward.

Full dmesg:
https://paste.ubuntu.com/p/C2V4CY3yjZ/

Output showing upstream ports belonging to those IRQs:
https://paste.ubuntu.com/p/6fkSbyFNWT/

Full lspci:
https://paste.ubuntu.com/p/CTX5kbjpRP/

Let us know if you would like any additional debug information. As always, we
are happy to test patches out.

Thanks,
Matthew