linux-kernel - Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly put devices into D0 when initializing"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <222da706-19c5-485c-be90-2ebda20c1142@amd.com>
Date: Wed, 3 Dec 2025 23:29:42 -0600
From: Mario Limonciello <mario.limonciello@....com>
To: Matthew Ruffell <matthew.ruffell@...onical.com>
Cc: "bhelgaas@...gle.com" <bhelgaas@...gle.com>, linux-pci@...r.kernel.org,
 lkml <linux-kernel@...r.kernel.org>,
 Jay Vosburgh <jay.vosburgh@...onical.com>
Subject: Re: [PROBLEM] c5.metal on AWS fails to kexec after "PCI: Explicitly
 put devices into D0 when initializing"



On 12/3/2025 11:04 PM, Matthew Ruffell wrote:
> Hi Mario,
> 
> I thank you for your prompt reply, and apologise for my delayed reply.
> Answers inline.
> 
>> When you say AWS specific patches, can you be more specific?  What is
>> missing from a mainline kernel to use this hardware?  IE; how do I know
>> there aren't Ubuntu specific patches *causing* this issue.
> 
> I can reproduce the issue with the current HEAD of Linus's tree, with no
> additional patches applied. My current HEAD for testing is the 6.19 merge
> window, commit 51ab33fc0a8bef9454849371ef897a1241911b37.
> To get the mainline build to work on c5.metal on AWS I needed to edit a few
> config parameters, and I have attached the config I used.
> 
>> Now I've never used AWS - do you have an opportunity to do "regular"
>> reboots, or only kexec reboots?
>>
>> This issue only happens with a kexec reboot, right?
> 
> We can do regular and kexec reboots with the c5.metal instance type. The issue
> only happens with a kexec reboot.
> 
>> The first thing that jumps out at me is the code in
>> pci_device_shutdown() that clears bus mastering for a kexec reboot.
>> If you comment that out what happens?
> 
> I commented out the code that clears bus mastering, diff below, and kexec boots
> correctly now, and the NVME drive appears just as it did before
> "4d4c10f PCI: Explicitly put devices into D0 when initializing".
> 
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 302d61783f6c..0cb14ff32475 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -517,8 +517,9 @@ static void pci_device_shutdown(struct device *dev)
>           * If it is not a kexec reboot, firmware will hit the PCI
>           * devices with big hammer and stop their DMA any way.
>           */
> -       if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> -               pci_clear_master(pci_dev);
> +/*     if (kexec_in_progress && (pci_dev->current_state <= PCI_D3hot))
> + *             pci_clear_master(pci_dev);
> + */
>   }
> 
>   #ifdef CONFIG_PM_SLEEP
> 
> Since this works, does that mean that the bus master bit isn't being set on the
> NVME device on the other side of kexec?

That's at least what it seems like.  And I guess trying to set D0 
without bus mastering enabling is causing a problem.

Could you try adding a pci_set_master() call to pci_power_up()?  This is 
what I have in mind (only compile tested):

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006c..68661e333032 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1323,6 +1323,7 @@ int pci_power_up(struct pci_dev *dev)
                 return -EIO;
         }

+       pci_set_master(dev);
         pci_read_config_word(dev, dev->pm_cap + PCI_PM_CTRL, &pmcsr);
         if (PCI_POSSIBLE_ERROR(pmcsr)) {
                 pci_err(dev, "Unable to change power state from %s to 
D0, device inaccessible\n",

> 
>> The next thing I would wonder if if you're compiling with
>> CONFIG_KEXEC_JUMP and if that has an impact to your issue.  When this is
>> defined there is a device suspend sequence in kernel_kexec() that is run
>> which will run various suspend related callbacks.  Maybe the issue is
>> actually in one of those callbacks.
> 
> Yes, Ubuntu kernels set CONFIG_KEXEC_JUMP=y. I did a build with
> CONFIG_KEXEC_JUMP=n and it has the same symptoms.
> 
>> A possible way to determine this would be to run rtcwake to suspend and
>> resume and see if the drive survives.  If it doesn't, it's a hint that
>> there is something going on with power management in this drive or the
>> bridge it's connected to.  Maybe one of them isn't handling D3 very well.
> 
> Unfortunately, this c5.metal instance type doesn't support rtcwake with mode mem
> or disk, as hibernation is disabled on these instance types. But since
> CONFIG_KEXEC_JUMP=n doesn't help,
> 
> I'm going to add some debug statements to pci_device_shutdown() to see what
> state the NVME device is in with and without
> "4d4c10f PCI: Explicitly put devices into D0 when initializing".
> 
> Thanks,
> Matthew

Thanks for the updates.

I have a relatively ignorant question.  Can you reproduce with kdump and 
a crash too?

I don't actually know if you configure kdump and then crash the kernel 
(say magic sys-rq key), does pci_device_shutdown() get called in order 
to do the kexec?  Or because the kernel is already in a crash state is 
there just a jump into the crash kernel image location?