linux-kernel - Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully initialized

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aPKANja_k1gogTAU@google.com>
Date: Fri, 17 Oct 2025 10:43:18 -0700
From: Brian Norris <briannorris@...omium.org>
To: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
Cc: Lukas Wunner <lukas@...ner.de>, Bjorn Helgaas <bhelgaas@...gle.com>,
	LKML <linux-kernel@...r.kernel.org>, linux-pm@...r.kernel.org,
	"Rafael J . Wysocki" <rafael@...nel.org>, linux-pci@...r.kernel.org
Subject: Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully
 initialized

Hi Ilpo and Lukas,

I'll reply to both of you inline:

On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> On Fri, 17 Oct 2025, Lukas Wunner wrote:
> 
> > [cc += Ilpo]
> > 
> > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > PCI devices are created via pci_scan_slot() and similar, and are
> > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > expected that user space may override this via sysfs [1].
> 
> Is this true as pm_runtime_forbid() also increases PM usage count?

Yes it's true. See below.

> "void pm_runtime_forbid(struct device *dev);
> 
> unset the power.runtime_auto flag for the device and increase its 
> usage counter (used by the /sys/devices/.../power/control interface to 
> effectively prevent the device from being power managed at run time)"

Right, but sysfs `echo auto > .../power/control` performs the inverse --
pm_runtime_allow() -- which decrements that count.

> > > Now, sometime after initial scan, a PCI device receives its BAR
> > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > 
> > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > save/restore incorrect BAR configuration for the device, and the device
> > > may cease to function.
> > > 
> > > This behavior races with user space, since user space may enable runtime
> > > PM [1] as soon as it sees the device, which may be before BAR
> > > configuration.
> > > 
> > > Prevent suspending in this intermediate state by holding a runtime PM
> > > reference until the device is fully initialized and ready for probe().
> >
> > Not sure if that is comprehensible by everybody.

Yeah, thanks for trying to clarify. After getting too far into the weeds
on a bug, I sometimes don't spend the appropriate time on writing a
simple problem description. Maybe I can do better on a v2.

> > The point is that
> > unbound devices are left in D0 but are nevertheless allowed to
> > (logically) runtime suspend.  And pci_pm_runtime_suspend() may call
> > pci_save_state() while config space isn't fully initialized yet,
> > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > pci_pm_default_resume_early()) and overwrite initialized config space
> > with uninitialized data.

Ack.

> > Have you actually seen this happen in practice?

Yes, that's why I spent my time debugging and submitting this patch :)

> > Normally enumeration
> > happens during subsys_initcall time, when user space isn't running yet.
> > Hotplug may be an exception though.

Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
or PCI controller drivers built as modules.

I happen to be using both pwrctrl and controller drivers as modules, so
I hit it that way.

> Adding that pm_runtime_get_noresume() doesn't look useful given 
> pm_runtime_forbid() already increases PM usage count. If this problem is 
> actually seen in practice, there could a bug elsewhere where something 
> decrements usage count too early so this change "helps" by double 
> incrementing the usage count.
> 
> To find more information what's going on, one could try to trace events 
> for the PM usage count (though last time I looked not all paths that 
> change PM usage count were covered by the event and adding the 
> trace_event() calls into the header turned out too much magic for me to 
> figure out so I couldn't solve the problem).

See above. forbid() is not a guaranteed blocker, because user space can
undo it.

> > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > resource allocation and may judge whether this can actually happen.
> 
> I can see there could be other failure modes besides just saving wrong 
> config if devices get suspended underneath the resource assignment 
> algorithm.
> 
> Besides hotplug, also BAR resize does changes the resources and BARs.
> This case is not helped by this patch.

Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
references (pci_config_pm_runtime_{get,put}()) and therefore should not
generally have the same problem. pci-driver.c's runtime suspend will
save a new copy of the registers the next time we suspend after resize.

(Now, some drivers could have problems if they try to stash a static
copy via pci_store_saved_state()/pci_load_saved_state(), but that
invites plenty of its own problems anyway.)

> I also recently learned some DT platforms do the "actual" scan for the bus 
> later on Link Up event through irq which could perhaps occur late enough, 
> I dunno for sure.

Sure, but that'd be covered by my patch, as those (re)scans would
discover new devices in the same scan+add flow.

> > I think the code comments you're adding are a little verbose and a simple
> > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> 
> I'm also not entirely convinced these always pair

That's a very valid question. There are so many variations of scan+add
that it's been hard for me to tell. I've studied the code pretty
closely, and tested what I could, but I don't have hotplug systems on
hand, and I definitely could miss something.

FWIW, Rafael suggested/implied an alternative, where I could simply
delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
the pairing questions, I think.

> or if the pci_dev may 
> get removed before pci_bus_add_device(), see e.g., enable_slot() in 
> hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before 
> pci_bus_add_devices() (which could have other bugs too as is).

I believe it should be OK if a device is removed before the
pm_runtime_put_noidle() half of the pair. It just means the device gets
destroyed with a nonzero PM usage count, which is legal.

> > Also, I think it is neither necessary nor useful to actually cc the e-mail
> > to stable@...r.kernel.org if you include a stable designation in the
> > patch.  I believe stable maintainers only pick up backports from that list,
> > not patches intended for upstream.
> 
> Probably some tool will auto-insert the Cc: tags as receipients so it 
> might be non-trivial to get rid of it.

Yeah, git-send-email applies "Cc:" lines automatically, so I expect it's
very common for people to do that. I use U-Boot's patman, which wraps
git-send-email. I just assume stable@ folks expect that or are at least
used to it, because ... well, "Cc:" usually means "copy this on the
email" ...

git-send-email has --suppress-cc, and maybe I can convince my tools to
do that. (e.g., `git config sendemail.suppresscc cc`)

Brian