lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <jGQc86Npv2BVcA61A7EPFQYcclIuxb07m-UqU0w22FA8_o3-0_xc6OQPp_CHDBZhId9acH4hyiOqki9w7Q0-WmuoVqsCoQfefaHNdfcV2ww=@ranguvar.io>
Date: Sat, 14 Dec 2024 06:32:57 +0000
From: Ranguvar <ranguvar@...guvar.io>
To: Peter Zijlstra <peterz@...radead.org>, "regressions@...ts.linux.dev" <regressions@...ts.linux.dev>
Cc: "regressions@...mhuis.info" <regressions@...mhuis.info>, "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>, "kvm@...r.kernel.org" <kvm@...r.kernel.org>
Subject: [REGRESSION][BISECTED] from bd9bbc96e835: cannot boot Win11 KVM guest

Hello, all,

Any assistance with proper format and process is appreciated as I am new to these lists.
After the commit bd9bbc96e835 "sched: Rework dl_server" I am no longer able to boot my Windows 11 23H2 guest using pinned/exclusive CPU cores and passing a PCIe graphics card.
This setup worked for me since at least 5.10, likely earlier, with minimal changes.

Most or all cores assigned to guest VM report 100% usage, and many tasks on the host hang indefinitely (10min+) until the guest is forcibly stopped.
This happens only once the Windows kernel begins loading - its spinner appears and freezes.

Still broken on 6.13-rc2, as well as 6.12.4 from Arch's repository.
When testing these, the failure is similar, but tasks on the host are slow to execute instead of stalling indefinitely, and hung tasks are not reported in dmesg. Only one guest core may show 100% utilization instead of many or all of them. This seems to be due to a separate regression which also impacts my usecase [0].
After patching it [1], I then find the same behavior as bd9bbc96e835, with hung tasks on host.

git bisect log: [2]
dmesg from 6.11.0-rc1-1-git-00057-gbd9bbc96e835, with decoded hung task backtraces: [3]
dmesg from arch 6.12.4: [4]
dmesg from arch 6.12.4 patched for svm.c regression, has hung tasks, backtraces could not be decoded: [5]
config for 6.11.0-rc1-1-git-00057-gbd9bbc96e835: [6]
config for arch 6.12.4: [7]

If it helps, my host uses an AMD Ryzen 5950X CPU with latest UEFI and AMD WX 5100 (Polaris, GCN 4.0) PCIe graphics.
I use libvirt 10.10 and qemu 9.1.2, and I am passing three PCIe devices each from dedicated IOMMU groups: NVIDIA RTX 3090 graphics, a Renesas uPD720201 USB controller, and a Samsung 970 EVO NVMe disk.

I have in kernel cmdline `iommu=pt isolcpus=1-7,17-23 rcu_nocbs=1-7,17-23 nohz_full=1-7,17-23`.
Removing iommu=pt does not produce a change, and dropping the core isolation freezes the host on VM startup.
Enabling/disabling kvm_amd.nested or kvm.enable_virt_at_load did not produce a change.

Thank you for your attention.
- Devin

#regzbot introduced: bd9bbc96e8356886971317f57994247ca491dbf1

[0]: https://lore.kernel.org/regressions/52914da7-a97b-45ad-86a0-affdf8266c61@mailbox.org/
[1]: https://lore.kernel.org/regressions/376c445a-9437-4bdd-9b67-e7ce786ae2c4@mailbox.org/
[2]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/bisect.log
[3]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/dmesg-6.11.0-rc1-1-git-00057-gbd9bbc96e835-decoded.log
[4]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/dmesg-6.12.4-arch1-1.log
[5]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/dmesg-6.12.4-arch1-1-patched.log
[6]: https://ranguvar.io/pub/paste/linux-6.12-vm-regression/config-6.11.0-rc1-1-git-00057-gbd9bbc96e835
[7]: https://gitlab.archlinux.org/archlinux/packaging/packages/linux/-/raw/6.12.4.arch1-1/config

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ