[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <222da706-19c5-485c-be90-2ebda20c1142@amd.com>
Date: Wed, 3 Dec 2025 23:29:42 -0600
From: Mario Limonciello <mario.limonciello@....com>
To: Matthew Ruffell <matthew.ruffell@...onical.com>
Cc: "bhelgaas@...gle.com" <bhelgaas@...gle.com>, linux-pci@...r.kernel.org,
lkml <linux-kernel@...r.kernel.org>,
Jay Vosburgh <jay.vosburgh@...onical.com>
Subject: Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly
put devices into D0 when initializing"
On 12/3/2025 11:04 PM, Matthew Ruffell wrote:
> Hi Mario,
>
> I thank you for your prompt reply, and apologise for my delayed reply.
> Answers inline.
>
>> When you say AWS specific patches, can you be more specific? What is
>> missing from a mainline kernel to use this hardware? IE; how do I know
>> there aren't Ubuntu specific patches *causing* this issue.
>
> I can reproduce the issue with the current HEAD of Linus's tree, with no
> additional patches applied. My current HEAD for testing is the 6.19 merge
> window, commit 51ab33fc0a8bef9454849371ef897a1241911b37.
> To get the mainline build to work on c5.metal on AWS I needed to edit a few
> config parameters, and I have attached the config I used.
>
>> Now I've never used AWS - do you have an opportunity to do "regular"
>> reboots, or only kexec reboots?
>>
>> This issue only happens with a kexec reboot, right?
>
> We can do regular and kexec reboots with the c5.metal instance type. The issue
> only happens with a kexec reboot.
>
>> The first thing that jumps out at me is the code in
>> pci_device_shutdown() that clears bus mastering for a kexec reboot.
>> If you comment that out what happens?
>
> I commented out the code that clears bus mastering, diff below, and kexec boots
> correctly now, and the NVME drive appears just as it did before
> "4d4c10f PCI: Explicitly put devices into D0 when initializing".
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 302d61783f6c..0cb14ff32475 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -517,8 +517,9 @@ static void pci_device_shutdown(struct device *dev)
> * If it is not a kexec reboot, firmware will hit the PCI
> * devices with big hammer and stop their DMA any way.
> */
> - if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> - pci_clear_master(pci_dev);
> +/* if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> + * pci_clear_master(pci_dev);
> + */
> }
>
> #ifdef CONFIG_PM_SLEEP
>
> Since this works, does that mean that the bus master bit isn't being set on the
> NVME device on the other side of kexec?
That's at least what it seems like. And I guess trying to set D0
without bus mastering enabling is causing a problem.
Could you try adding a pci_set_master() call to pci_power_up()? This is
what I have in mind (only compile tested):
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006c..68661e333032 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1323,6 +1323,7 @@ int pci_power_up(struct pci_dev *dev)
return -EIO;
}
+ pci_set_master(dev);
pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
if (PCI_POSSIBLE_ERROR(pmcsr)) {
pci_err(dev, "Unable to change power state from %s to
D0, device inaccessible\n",
>
>> The next thing I would wonder if if you're compiling with
>> CONFIG_KEXEC_JUMP and if that has an impact to your issue. When this is
>> defined there is a device suspend sequence in kernel_kexec() that is run
>> which will run various suspend related callbacks. Maybe the issue is
>> actually in one of those callbacks.
>
> Yes, Ubuntu kernels set CONFIG_KEXEC_JUMP=y. I did a build with
> CONFIG_KEXEC_JUMP=n and it has the same symptoms.
>
>> A possible way to determine this would be to run rtcwake to suspend and
>> resume and see if the drive survives. If it doesn't, it's a hint that
>> there is something going on with power management in this drive or the
>> bridge it's connected to. Maybe one of them isn't handling D3 very well.
>
> Unfortunately, this c5.metal instance type doesn't support rtcwake with mode mem
> or disk, as hibernation is disabled on these instance types. But since
> CONFIG_KEXEC_JUMP=n doesn't help,
>
> I'm going to add some debug statements to pci_device_shutdown() to see what
> state the NVME device is in with and without
> "4d4c10f PCI: Explicitly put devices into D0 when initializing".
>
> Thanks,
> Matthew
Thanks for the updates.
I have a relatively ignorant question. Can you reproduce with kdump and
a crash too?
I don't actually know if you configure kdump and then crash the kernel
(say magic sys-rq key), does pci_device_shutdown() get called in order
to do the kexec? Or because the kernel is already in a crash state is
there just a jump into the crash kernel image location?
Powered by blists - more mailing lists