[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKAwkKvmdKxRRA4cR=jJEdyadon6uKXe+aFXaGSe=PNSgwDf9g@mail.gmail.com>
Date: Fri, 19 Sep 2025 15:52:33 +1200
From: Matthew Ruffell <matthew.ruffell@...onical.com>
To: mario.limonciello@....com, "bhelgaas@...gle.com" <bhelgaas@...gle.com>
Cc: linux-pci@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>,
Jay Vosburgh <jay.vosburgh@...onical.com>
Subject: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put
devices into D0 when initializing"
Hi Mario, Bjorn,
I am debugging a kexec regression, and I could use some help please.
The AWS "c5.metal" instance type fails to kexec into another kernel, and gets
stuck during boot trying to mount the rootfs from the NVME drive, and then moves
at a glacier pace and never actually boots:
[ 79.172085] EXT4-fs (nvme0n1p1): orphan cleanup on readonly fs
[ 79.193407] EXT4-fs (nvme0n1p1): mounted filesystem
a4f7c460-5723-4ed1-9e86-04496bd66119 ro with ordered data mode. Quota
mode: none.
[ 109.606598] systemd[1]: Inserted module 'autofs4'
[ 139.786021] systemd[1]: systemd 257.9-0ubuntu1 running in system
mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +IPE +SMACK +SECCOMP +GCRYPT
-GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC
+KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIBFDISK +PCRE2
+PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD
+BPF_FRAMEWORK +BTF -XKBCOMMON -UTMP +SYSVINIT +LIBARCHIVE)
[ 139.943485] systemd[1]: Detected architecture x86-64.
[ 169.994695] systemd[1]: Hostname set to <ip-172-31-48-167>.
[ 170.102479] systemd[1]: bpf-restrict-fs: BPF LSM hook not enabled
in the kernel, BPF LSM not supported.
[ 200.503000] systemd[1]: Queued start job for default target graphical.target.
[ 200.550056] systemd[1]: Created slice system-modprobe.slice - Slice
/system/modprobe.
[ 230.922947] systemd[1]: Created slice system-serial\x2dgetty.slice
- Slice /system/serial-getty.
[ 261.131318] systemd[1]: Created slice system-systemd\x2dfsck.slice
- Slice /system/systemd-fsck.
[ 291.338906] systemd[1]: Created slice user.slice - User and Session Slice.
[ 321.546200] systemd[1]: Started systemd-ask-password-wall.path -
Forward Password Requests to Wall Directory Watch.
I bisected the issue, and the behaviour starts with:
commit 4d4c10f763d7808fbade28d83d237411603bca05
Author: Mario Limonciello <mario.limonciello@....com>
Date: Wed Apr 23 23:31:32 2025 -0500
Subject: PCI: Explicitly put devices into D0 when initializing
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d4c10f763d7808fbade28d83d237411603bca05
I also tried the follow up commit:
commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
Author: Mario Limonciello <mario.limonciello@....com>
Date: Wed Jun 11 18:31:16 2025 -0500
Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
and the behaviour still exists.
If I revert both from 6.17-rc3, as well as the downstream Ubuntu stable kernels,
the system kexec's successfully as normal.
lspci -vvv as root (nvme device)
https://paste.ubuntu.com/p/x7Zyjp8Brr/
lscpi -vvv as root (full output)
https://paste.ubuntu.com/p/NTdbByTqjR/
Strangely, the behaviour works like this:
Kernel without 4d4c10f76 -> kernel without 4d4c10f76 = success
Kernel without 4d4c10f76 -> kernel with 4d4c10f76 = success
Kernel with 4d4c10f76 -> kernel without 4d4c10f76 = failure
Kernel with 4d4c10f76 -> kernel with 4d4c10f76 = failure
Steps to reproduce:
1) On AWS, Launch a c5.metal instance type
2) Install a kernel with 4d4c10f76, note it might need AWS specific patches,
perhaps try a recent downstream distro kernel such as 6.17.0-1001-aws in Ubuntu
Questing with AMI ami-069b93def587ece0f
(ubuntu/images-testing/hvm-ssd-gp3/ubuntu-questing-daily-amd64-server-20250822)
with a full apt update && apt upgrade
3) sudo reboot, to get a fresh full boot. Note, this takes approx 17 minutes.
4) sudo apt install kexec-tools
5) kernel=6.17.0-1001-aws
kexec -l -t bzImage /boot/vmlinuz-$kernel
--initrd=/boot/initrd.img-$kernel --reuse-cmdline
kexec -e
6) On EC2 console, Actions > Monitor and troubleshoot > EC2 serial console,
and watch progress.
I am more than happy to try any patches / debug printk's etc.
Thanks,
Matthew
Powered by blists - more mailing lists