[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cecdf440-ec7b-4d7f-9121-cf44332702d4@amd.com>
Date: Fri, 19 Sep 2025 00:02:22 -0500
From: Mario Limonciello <mario.limonciello@....com>
To: Matthew Ruffell <matthew.ruffell@...onical.com>,
"bhelgaas@...gle.com" <bhelgaas@...gle.com>
Cc: linux-pci@...r.kernel.org, lkml <linux-kernel@...r.kernel.org>,
Jay Vosburgh <jay.vosburgh@...onical.com>
Subject: Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly
put devices into D0 when initializing"
On 9/18/2025 10:52 PM, Matthew Ruffell wrote:
> Hi Mario, Bjorn,
>
> I am debugging a kexec regression, and I could use some help please.
>
> The AWS "c5.metal" instance type fails to kexec into another kernel, and gets
> stuck during boot trying to mount the rootfs from the NVME drive, and then moves
> at a glacier pace and never actually boots:
>
> [ 79.172085] EXT4-fs (nvme0n1p1): orphan cleanup on readonly fs
> [ 79.193407] EXT4-fs (nvme0n1p1): mounted filesystem
> a4f7c460-5723-4ed1-9e86-04496bd66119 ro with ordered data mode. Quota
> mode: none.
> [ 109.606598] systemd[1]: Inserted module 'autofs4'
> [ 139.786021] systemd[1]: systemd 257.9-0ubuntu1 running in system
> mode (+PAM +AUDIT +SELINUX +APPARMOR +IMA +IPE +SMACK +SECCOMP +GCRYPT
> -GNUTLS +OPENSSL +ACL +BLKID +CURL +ELFUTILS +FIDO2 +IDN2 -IDN +IPTC
> +KMOD +LIBCRYPTSETUP +LIBCRYPTSETUP_PLUGINS +LIBFDISK +PCRE2
> +PWQUALITY +P11KIT +QRENCODE +TPM2 +BZIP2 +LZ4 +XZ +ZLIB +ZSTD
> +BPF_FRAMEWORK +BTF -XKBCOMMON -UTMP +SYSVINIT +LIBARCHIVE)
> [ 139.943485] systemd[1]: Detected architecture x86-64.
> [ 169.994695] systemd[1]: Hostname set to <ip-172-31-48-167>.
> [ 170.102479] systemd[1]: bpf-restrict-fs: BPF LSM hook not enabled
> in the kernel, BPF LSM not supported.
> [ 200.503000] systemd[1]: Queued start job for default target graphical.target.
> [ 200.550056] systemd[1]: Created slice system-modprobe.slice - Slice
> /system/modprobe.
> [ 230.922947] systemd[1]: Created slice system-serial\x2dgetty.slice
> - Slice /system/serial-getty.
> [ 261.131318] systemd[1]: Created slice system-systemd\x2dfsck.slice
> - Slice /system/systemd-fsck.
> [ 291.338906] systemd[1]: Created slice user.slice - User and Session Slice.
> [ 321.546200] systemd[1]: Started systemd-ask-password-wall.path -
> Forward Password Requests to Wall Directory Watch.
>
> I bisected the issue, and the behaviour starts with:
>
> commit 4d4c10f763d7808fbade28d83d237411603bca05
> Author: Mario Limonciello <mario.limonciello@....com>
> Date: Wed Apr 23 23:31:32 2025 -0500
> Subject: PCI: Explicitly put devices into D0 when initializing
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4d4c10f763d7808fbade28d83d237411603bca05
>
> I also tried the follow up commit:
>
> commit 907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
> Author: Mario Limonciello <mario.limonciello@....com>
> Date: Wed Jun 11 18:31:16 2025 -0500
> Subject: PCI/PM: Set up runtime PM even for devices without PCI PM
> Link: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0
>
> and the behaviour still exists.
>
> If I revert both from 6.17-rc3, as well as the downstream Ubuntu stable kernels,
> the system kexec's successfully as normal.
>
> lspci -vvv as root (nvme device)
> https://paste.ubuntu.com/p/x7Zyjp8Brr/
>
> lscpi -vvv as root (full output)
> https://paste.ubuntu.com/p/NTdbByTqjR/
>
> Strangely, the behaviour works like this:
>
> Kernel without 4d4c10f76 -> kernel without 4d4c10f76 = success
> Kernel without 4d4c10f76 -> kernel with 4d4c10f76 = success
> Kernel with 4d4c10f76 -> kernel without 4d4c10f76 = failure
> Kernel with 4d4c10f76 -> kernel with 4d4c10f76 = failure
>
> Steps to reproduce:
> 1) On AWS, Launch a c5.metal instance type
> 2) Install a kernel with 4d4c10f76, note it might need AWS specific patches,
> perhaps try a recent downstream distro kernel such as 6.17.0-1001-aws in Ubuntu
> Questing with AMI ami-069b93def587ece0f
> (ubuntu/images-testing/hvm-ssd-gp3/ubuntu-questing-daily-amd64-server-20250822)
> with a full apt update && apt upgrade
> 3) sudo reboot, to get a fresh full boot. Note, this takes approx 17 minutes.
> 4) sudo apt install kexec-tools
> 5) kernel=6.17.0-1001-aws
> kexec -l -t bzImage /boot/vmlinuz-$kernel
> --initrd=/boot/initrd.img-$kernel --reuse-cmdline
> kexec -e
> 6) On EC2 console, Actions > Monitor and troubleshoot > EC2 serial console,
> and watch progress.
>
> I am more than happy to try any patches / debug printk's etc.
>
> Thanks,
> Matthew
When you say AWS specific patches, can you be more specific? What is
missing from a mainline kernel to use this hardware? IE; how do I know
there aren't Ubuntu specific patches *causing* this issue.
I just glanced through a Ubuntu kernel tree log and there are a ton of
"UBUNTU: SAUCE: PCI" patches. I didn't investigate any of these anymore
than a cursory look of the subsystem though, so I have no idea if that
has any bearing on this issue.
I remember a while back there was a patch carried by Ubuntu that could
break a regular shutdown that never made it upstream. Don't know what
happened with that either.
So I don't doubt you when you say
4d4c10f763d7808fbade28d83d237411603bca05 and
907a7a2e5bf40c6a359b2f6cc53d6fdca04009e0 caused an issue, but I just
want to rule out a bad interaction from other patches. If it would be
possible to reproduce this issue on a mainline kernel (say 6.17-rc6) it
might be easier for Bjorn or I to look at.
Now I've never used AWS - do you have an opportunity to do "regular"
reboots, or only kexec reboots?
This issue only happens with a kexec reboot, right?
The first thing that jumps out at me is the code in
pci_device_shutdown() that clears bus mastering for a kexec reboot.
If you comment that out what happens?
The next thing I would wonder if if you're compiling with
CONFIG_KEXEC_JUMP and if that has an impact to your issue. When this is
defined there is a device suspend sequence in kernel_kexec() that is run
which will run various suspend related callbacks. Maybe the issue is
actually in one of those callbacks.
A possible way to determine this would be to run rtcwake to suspend and
resume and see if the drive survives. If it doesn't, it's a hint that
there is something going on with power management in this drive or the
bridge it's connected to. Maybe one of them isn't handling D3 very well.
If there is a power management problem with the disk (or the bridge) you
can try adding PCI_DEV_FLAGS_NO_D3 to the NVME disk.
Powered by blists - more mailing lists