lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKAwkKvmdKxRRA4cR=jJEdyadon6uKXe+aFXaGSe=PNSgwDf9g@mail.gmail.com>
Date: Fri, 19 Sep 2025 15:52:33 +1200
From: Matthew Ruffell <matthew.ruffell@...onical.com>
To: mario.limonciello@....com, "bhelgaas@...gle.com" <bhelgaas@...gle.com>
Cc: linux-pci@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>, 
	Jay Vosburgh <jay.vosburgh@...onical.com>
Subject: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put
 devices into D0 when initializing"

Hi Mario, Bjorn,

I am debugging a kexec regression, and I could use some help please.

The AWS "c5.metal" instance type fails to kexec into another kernel, and gets
stuck during boot trying to mount the rootfs from the NVME drive, and then moves
at a glacier pace and never actually boots:

[   79.172085] EXT4-fs (nvme0n1p1): orphan cleanup on readonly fs
[   79.193407] EXT4-fs (nvme0n1p1): mounted filesystem
a4f7c460-5723-4ed1-9e86-04496bd66119 ro with ordered data mode. Quota
mode: none.
[  109.606598] systemd[1]: Inserted module 'autofs4'
[  139.786021] systemd[1]: systemd 257.9-0ubuntu1 running in system
mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +IPE +SMACK +SECCOMP +GCRYPT
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC
+KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIBFDISK +PCRE2
+PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD
+BPF_FRAMEWORK +BTF -XKBCOMMON -UTMP +SYSVINIT +LIBARCHIVE)
[  139.943485] systemd[1]: Detected architecture x86-64.
[  169.994695] systemd[1]: Hostname set to <ip-172-31-48-167>.
[  170.102479] systemd[1]: bpf-restrict-fs: BPF LSM hook not enabled
in the kernel, BPF LSM not supported.
[  200.503000] systemd[1]: Queued start job for default target graphical.target.
[  200.550056] systemd[1]: Created slice system-modprobe.slice - Slice
/system/modprobe.
[  230.922947] systemd[1]: Created slice system-serial\x2dgetty.slice
- Slice /system/serial-getty.
[  261.131318] systemd[1]: Created slice system-systemd\x2dfsck.slice
- Slice /system/systemd-fsck.
[  291.338906] systemd[1]: Created slice user.slice - User and Session Slice.
[  321.546200] systemd[1]: Started systemd-ask-password-wall.path -
Forward Password Requests to Wall Directory Watch.

I bisected the issue, and the behaviour starts with:

commit 4d4c10f763d7808fbade28d83d237411603bca05
Author: Mario Limonciello <mario.limonciello@....com>
Date:  Wed Apr 23 23:31:32 2025 -0500
Subject: PCI: Explicitly put devices into D0 when initializing
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d4c10f763d7808fbade28d83d237411603bca05

I also tried the follow up commit:

commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
Author: Mario Limonciello <mario.limonciello@....com>
Date:  Wed Jun 11 18:31:16 2025 -0500
Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0

and the behaviour still exists.

If I revert both from 6.17-rc3, as well as the downstream Ubuntu stable kernels,
the system kexec's successfully as normal.

lspci -vvv as root (nvme device)
https://paste.ubuntu.com/p/x7Zyjp8Brr/

lscpi -vvv as root (full output)
https://paste.ubuntu.com/p/NTdbByTqjR/

Strangely, the behaviour works like this:

Kernel without 4d4c10f76 -> kernel without 4d4c10f76 = success
Kernel without 4d4c10f76 -> kernel with 4d4c10f76 = success
Kernel with 4d4c10f76 -> kernel without 4d4c10f76 = failure
Kernel with 4d4c10f76 -> kernel with 4d4c10f76 = failure

Steps to reproduce:
1) On AWS, Launch a c5.metal instance type
2) Install a kernel with 4d4c10f76, note it might need AWS specific patches,
perhaps try a recent downstream distro kernel such as 6.17.0-1001-aws in Ubuntu
Questing with AMI ami-069b93def587ece0f
(ubuntu/images-testing/hvm-ssd-gp3/ubuntu-questing-daily-amd64-server-20250822)
with a full apt update && apt upgrade
3) sudo reboot, to get a fresh full boot. Note, this takes approx 17 minutes.
4) sudo apt install kexec-tools
5) kernel=6.17.0-1001-aws
kexec -l -t bzImage /boot/vmlinuz-$kernel
--initrd=/boot/initrd.img-$kernel --reuse-cmdline
kexec -e
6) On EC2 console, Actions > Monitor and troubleshoot > EC2 serial console,
and watch progress.

I am more than happy to try any patches / debug printk's etc.

Thanks,
Matthew

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ