[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <08976178-298f-79d9-1d63-cff5a4e56cc3@linux.intel.com>
Date: Mon, 20 Oct 2025 18:56:41 +0300 (EEST)
From: Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
To: Brian Norris <briannorris@...omium.org>
cc: Lukas Wunner <lukas@...ner.de>, Bjorn Helgaas <bhelgaas@...gle.com>,
LKML <linux-kernel@...r.kernel.org>, linux-pm@...r.kernel.org,
"Rafael J . Wysocki" <rafael@...nel.org>, linux-pci@...r.kernel.org
Subject: Re: [PATCH] PCI/PM: Prevent runtime suspend before devices are fully
initialized
On Fri, 17 Oct 2025, Brian Norris wrote:
> Hi Ilpo and Lukas,
>
> I'll reply to both of you inline:
I see you posted v2 but I'm answering here to have the context available.
Mostly things seem okay after your explanation, I think the only question
mark is a driver calling pci_resize_resource() directly to resize BARs.
> On Fri, Oct 17, 2025 at 02:49:35PM +0300, Ilpo Järvinen wrote:
> > On Fri, 17 Oct 2025, Lukas Wunner wrote:
> >
> > > [cc += Ilpo]
> > >
> > > On Thu, Oct 16, 2025 at 03:53:35PM -0700, Brian Norris wrote:
> > > > PCI devices are created via pci_scan_slot() and similar, and are
> > > > promptly configured for runtime PM (pci_pm_init()). They are initially
> > > > prevented from suspending by way of pm_runtime_forbid(); however, it's
> > > > expected that user space may override this via sysfs [1].
> >
> > Is this true as pm_runtime_forbid() also increases PM usage count?
>
> Yes it's true. See below.
>
> > "void pm_runtime_forbid(struct device *dev);
> >
> > unset the power.runtime_auto flag for the device and increase its
> > usage counter (used by the /sys/devices/.../power/control interface to
> > effectively prevent the device from being power managed at run time)"
>
> Right, but sysfs `echo auto > .../power/control` performs the inverse --
> pm_runtime_allow() -- which decrements that count.
Fair enough, I didn't check what it does.
IMO, the details about how the usage count behaves should be part of the
changelog as that documentation I quoted sounded like user control is
prevented when forbidden. I see you've put this part of the explanation
into the v2 as well so I suggest you explain the usage count in the change
so it is recorded in the commit if somebody has to look at this commit
years from now.
> > > > Now, sometime after initial scan, a PCI device receives its BAR
> > > > configuration (pci_assign_unassigned_bus_resources(), etc.).
> > > >
> > > > If a PCI device is allowed to suspend between pci_scan_slot() and
> > > > pci_assign_unassigned_bus_resources(), then pci-driver.c will
> > > > save/restore incorrect BAR configuration for the device, and the device
> > > > may cease to function.
> > > >
> > > > This behavior races with user space, since user space may enable runtime
> > > > PM [1] as soon as it sees the device, which may be before BAR
> > > > configuration.
> > > >
> > > > Prevent suspending in this intermediate state by holding a runtime PM
> > > > reference until the device is fully initialized and ready for probe().
> > >
> > > Not sure if that is comprehensible by everybody.
>
> Yeah, thanks for trying to clarify. After getting too far into the weeds
> on a bug, I sometimes don't spend the appropriate time on writing a
> simple problem description. Maybe I can do better on a v2.
>
> > > The point is that
> > > unbound devices are left in D0 but are nevertheless allowed to
> > > (logically) runtime suspend. And pci_pm_runtime_suspend() may call
> > > pci_save_state() while config space isn't fully initialized yet,
> > > or pci_pm_runtime_resume() may call pci_restore_state() (via
> > > pci_pm_default_resume_early()) and overwrite initialized config space
> > > with uninitialized data.
>
> Ack.
>
> > > Have you actually seen this happen in practice?
>
> Yes, that's why I spent my time debugging and submitting this patch :)
Thanks for doing it! :-)
> > > Normally enumeration
> > > happens during subsys_initcall time, when user space isn't running yet.
> > > Hotplug may be an exception though.
>
> Hotplug, rescan (e.g., when pwrctrl is in use, power may be stablished
> later on, and it triggers a bus rescan; pwrctrl drivers can be modules),
> or PCI controller drivers built as modules.
>
> I happen to be using both pwrctrl and controller drivers as modules, so
> I hit it that way.
>
> > Adding that pm_runtime_get_noresume() doesn't look useful given
> > pm_runtime_forbid() already increases PM usage count. If this problem is
> > actually seen in practice, there could a bug elsewhere where something
> > decrements usage count too early so this change "helps" by double
> > incrementing the usage count.
> >
> > To find more information what's going on, one could try to trace events
> > for the PM usage count (though last time I looked not all paths that
> > change PM usage count were covered by the event and adding the
> > trace_event() calls into the header turned out too much magic for me to
> > figure out so I couldn't solve the problem).
>
> See above. forbid() is not a guaranteed blocker, because user space can
> undo it.
>
> > > Patch LGTM in principle, but adding Ilpo to cc who is refactoring PCI
> > > resource allocation and may judge whether this can actually happen.
> >
> > I can see there could be other failure modes besides just saving wrong
> > config if devices get suspended underneath the resource assignment
> > algorithm.
> >
> > Besides hotplug, also BAR resize does changes the resources and BARs.
> > This case is not helped by this patch.
>
> Is that the 'resource_N_resize' sysfs attributes? Becuase that holds PM
> references (pci_config_pm_runtime_{get,put}()) and therefore should not
> generally have the same problem.
Okay, seem fine for the PCI core part.
Driver's can also trigger BAR resize by calling pci_resize_resource()
directly but I've no idea how the usage counts behave (TBH, PM isn't my
strongest forte even if Lukas pulled me in to comment).
> pci-driver.c's runtime suspend will
> save a new copy of the registers the next time we suspend after resize.
>
> (Now, some drivers could have problems if they try to stash a static
> copy via pci_store_saved_state()/pci_load_saved_state(), but that
> invites plenty of its own problems anyway.)
>
> > I also recently learned some DT platforms do the "actual" scan for the bus
> > later on Link Up event through irq which could perhaps occur late enough,
> > I dunno for sure.
>
> Sure, but that'd be covered by my patch, as those (re)scans would
> discover new devices in the same scan+add flow.
Okay, maybe it's fine like the rest. I was mostly trying to think
non-subsys_initcall() cases I knew of without contextualizing them back to
this patch.
> > > I think the code comments you're adding are a little verbose and a simple
> > > /* acquired in pci_pm_init() */ in pci_bus_add_device() may be sufficient.
> >
> > I'm also not entirely convinced these always pair
>
> That's a very valid question. There are so many variations of scan+add
> that it's been hard for me to tell.
I've noticed... unfortunately I find myself often in the same boat. :-/
> I've studied the code pretty
> closely, and tested what I could, but I don't have hotplug systems on
> hand, and I definitely could miss something.
>
> FWIW, Rafael suggested/implied an alternative, where I could simply
> delay pm_runtime_enable() until pci_bus_add_device(). That would dodge
> the pairing questions, I think.
Yes.
> > or if the pci_dev may
> > get removed before pci_bus_add_device(), see e.g., enable_slot() in
> > hotplug/acpiphp_glue.c that does acpiphp_sanitize_bus() before
> > pci_bus_add_devices() (which could have other bugs too as is).
>
> I believe it should be OK if a device is removed before the
> pm_runtime_put_noidle() half of the pair. It just means the device gets
> destroyed with a nonzero PM usage count, which is legal.
Ah yes, I think I was too attached to the "pairing" thought at that point
to see this isn't like e.g. a lock/unlock pair.
--
i.
Powered by blists - more mailing lists