linux-kernel - Re: [PATCH v4] PCI/PM: Prevent runtime suspend before devices are fully initialized

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0e35a4e1-894a-47c1-9528-fc5ffbafd9e2@samsung.com>
Date: Wed, 14 Jan 2026 10:46:41 +0100
From: Marek Szyprowski <m.szyprowski@...sung.com>
To: Bjorn Helgaas <helgaas@...nel.org>, Brian Norris
	<briannorris@...omium.org>
Cc: Bjorn Helgaas <bhelgaas@...gle.com>, Lukas Wunner <lukas@...ner.de>,
	linux-kernel@...r.kernel.org, linux-pci@...r.kernel.org,
	linux-pm@...r.kernel.org, "Rafael J . Wysocki" <rafael@...nel.org>,
	Ilpo Järvinen <ilpo.jarvinen@...ux.intel.com>
Subject: Re: [PATCH v4] PCI/PM: Prevent runtime suspend before devices are
 fully initialized

On 06.01.2026 23:27, Bjorn Helgaas wrote:
> On Thu, Oct 23, 2025 at 02:09:01PM -0700, Brian Norris wrote:
>> Today, it's possible for a PCI device to be created and
>> runtime-suspended before it is fully initialized. When that happens, the
>> device will remain in D0, but the suspend process may save an
>> intermediate version of that device's state -- for example, without
>> appropriate BAR configuration. When the device later resumes, we'll
>> restore invalid PCI state and the device may not function.
>>
>> Prevent runtime suspend for PCI devices by deferring pm_runtime_enable()
>> until we've fully initialized the device.
>>
>> More details on how exactly this may occur:
>>
>> 1. PCI device is created by pci_scan_slot() or similar
>> 2. As part of pci_scan_slot(), pci_pm_init() enables runtime PM; the
>>     device starts "active" and we initially prevent (pm_runtime_forbid())
>>     suspend -- but see [*] footnote
>> 3. Underlying 'struct device' is added to the system (device_add());
>>     runtime PM can now be configured by user space
>> 4. PCI device receives BAR configuration
>>     (pci_assign_unassigned_bus_resources(), etc.)
>> 5. PCI device is added to the system in pci_bus_add_device()
>>
>> The device may potentially suspend between #3 and #4.
>>
>> [*] By default, pm_runtime_forbid() prevents suspending a device; but by
>> design [**], this can be overridden by user space policy via
>>
>>    echo auto > /sys/bus/pci/devices/.../power/control
>>
>> Thus, the above #3/#4 sequence is racy with user space (udev or
>> similar).
>>
>> Notably, many PCI devices are enumerated at subsys_initcall time and so
>> will not race with user space. However, there are several scenarios
>> where PCI devices are created later on, such as with hotplug or when
>> drivers (pwrctrl or controller drivers) are built as modules.
>>
>>    ---
>>
>> [**] The relationship between pm_runtime_forbid(), pm_runtime_allow(),
>> /sys/.../power/control, and the runtime PM usage counter can be subtle.
>> It appears that the intention of pm_runtime_forbid() /
>> pm_runtime_allow() is twofold:
>>
>> 1. Allow the user to disable runtime_pm (force device to always be
>>     powered on) through sysfs.
>> 2. Allow the driver to start with runtime_pm disabled (device forced
>>     on) and user space could later enable runtime_pm.
>>
>> This conclusion comes from reading `Documentation/power/runtime_pm.rst`,
>> specifically the section starting "The user space can effectively
>> disallow".
>>
>> This means that while pm_runtime_forbid() does technically increase the
>> runtime PM usage counter, this usage counter is not a guarantee of
>> functional correctness, because sysfs can decrease that count again.
>>
>>    ---
>>
>> Note that we also move pm_runtime_set_active(), but leave
>> pm_runtime_forbid() in place earlier in the initialization sequence, to
>> avoid confusing user space. From Documentation/power/runtime_pm.rst:
>>
>>    "It should be noted, however, that if the user space has already
>>    intentionally changed the value of /sys/devices/.../power/control to
>>    "auto" to allow the driver to power manage the device at run time, the
>>    driver may confuse it by using pm_runtime_forbid() this way."
>>
>> Thus, we should ensure pm_runtime_forbid() is called before the device
>> is available to user space.
>>
>> Link: https://lore.kernel.org/all/20251016155335.1.I60a53c170a8596661883bd2b4ef475155c7aa72b@changeid/
>> Signed-off-by: Brian Norris <briannorris@...omium.org>
>> Cc: <stable@...r.kernel.org>
>> Reviewed-by: Rafael J. Wysocki (Intel) <rafael@...nel.org>
> Applied to pci/pm for v6.20, thanks!  I tried to simplify the commit
> log so the issue isn't hidden by details.  Happy to restore things if
> I trimmed too much:
>
>      PCI/PM: Prevent runtime suspend until devices are fully initialized
>
>      Previously, it was possible for a PCI device to be runtime-suspended before
>      it was fully initialized. When that happened, the suspend process could
>      save invalid device state, for example, before BAR assignment. Restoring
>      the invalid state during resume may leave the device non-functional.
>
>      Prevent runtime suspend for PCI devices until they are fully initialized by
>      deferring pm_runtime_enable().
>
>      More details on how exactly this may occur:
>
>        1. PCI device is created by pci_scan_slot() or similar
>
>        2. As part of pci_scan_slot(), pci_pm_init() puts the device in D0 and
>           prevents runtime suspend prevented via pm_runtime_forbid()
>
>        3. pci_device_add() adds the underlying 'struct device' via device_add(),
>           which means user space can allow runtime suspend, e.g.,
>
>             echo auto > /sys/bus/pci/devices/.../power/control
>
>        4. PCI device receives BAR configuration
>           (pci_assign_unassigned_bus_resources(), etc.)
>
>        5. pci_bus_add_device() applies final fixups, saves device state, and
>           tries to attach a driver
>
>      The device may potentially be suspended between #3 and #5, so this is racy
>      with user space (udev or similar).
>
>      Many PCI devices are enumerated at subsys_initcall time and so will not
>      race with user space, but devices created later by hotplug or modular
>      pwrctrl or host controller drivers are susceptible to this race.
>
>      More runtime PM details at the first Link: below.
>
>      Signed-off-by: Brian Norris <briannorris@...omium.org>
>      [bhelgaas: simplify commit log]
>      Signed-off-by: Bjorn Helgaas <bhelgaas@...gle.com>
>      Reviewed-by: Rafael J. Wysocki (Intel) <rafael@...nel.org>
>      Cc: stable@...r.kernel.org
>      Link: https://lore.kernel.org/all/20251016155335.1.I60a53c170a8596661883bd2b4ef475155c7aa72b@changeid/
>      Link: https://patch.msgid.link/20251023140901.v4.1.I60a53c170a8596661883bd2b4ef475155c7aa72b@changeid

This patch landed recently in linux-next as commit c796513dc54e 
("PCI/PM: Prevent runtime suspend until devices are fully initialized"). 
In my tests I found that it sometimes causes the "pci 0000:01:00.0: 
runtime PM trying to activate child device 0000:01:00.0 but parent 
(0000:00:00.0) is not active" warning on Qualcomm Robotics RB5 board 
(arch/arm64/boot/dts/qcom/qrb5165-rb5.dts). This in turn causes a 
lockdep warning about console lock, but this is just a consequence of 
the runtime pm warning. Reverting $subject patch on top of current 
linux-next hides this warning.

Here is a kernel log:

pci 0000:01:00.0: [17cb:1101] type 00 class 0xff0000 PCIe Endpoint
pci 0000:01:00.0: BAR 0 [mem 0x00000000-0x000fffff 64bit]
pci 0000:01:00.0: PME# supported from D0 D3hot D3cold
pci 0000:01:00.0: 4.000 Gb/s available PCIe bandwidth, limited by 5.0 
GT/s PCIe x1 link at 0000:00:00.0 (capable of 7.876 Gb/s with 8.0 GT/s 
PCIe x1 link)
pci 0000:01:00.0: Adding to iommu group 13
pci 0000:01:00.0: ASPM: default states L0s L1
pcieport 0000:00:00.0: bridge window [mem 0x60400000-0x604fffff]: assigned
pci 0000:01:00.0: BAR 0 [mem 0x60400000-0x604fffff 64bit]: assigned
pci 0000:01:00.0: runtime PM trying to activate child device 
0000:01:00.0 but parent (0000:00:00.0) is not active

======================================================
WARNING: possible circular locking dependency detected
6.19.0-rc1+ #16398 Not tainted
------------------------------------------------------
kworker/3:0/33 is trying to acquire lock:
ffffcd182ff1ae98 (console_owner){..-.}-{0:0}, at: 
console_lock_spinning_enable+0x44/0x78

but task is already holding lock:
ffff0000835c5238 (&dev->power.lock/1){....}-{3:3}, at: 
__pm_runtime_set_status+0x240/0x384

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #3 (&dev->power.lock/1){....}-{3:3}:
        _raw_spin_lock_nested+0x44/0x5c
        __pm_runtime_set_status+0x240/0x384
        arm_smmu_device_probe+0xbe0/0xe5c
        platform_probe+0x5c/0xac
        really_probe+0xbc/0x298
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0x40/0x164
        __driver_attach+0x9c/0x1ac
        bus_for_each_dev+0x74/0xd0
        driver_attach+0x24/0x30
        bus_add_driver+0xe4/0x208
        driver_register+0x60/0x128
        __platform_driver_register+0x24/0x30
        arm_smmu_driver_init+0x20/0x2c
        do_one_initcall+0x64/0x308
        kernel_init_freeable+0x284/0x500
        kernel_init+0x20/0x1d8
        ret_from_fork+0x10/0x20

-> #2 (&dev->power.lock){-...}-{3:3}:
        _raw_spin_lock_irqsave+0x60/0x88
        __pm_runtime_resume+0x4c/0xbc
        __uart_start+0x4c/0x114
        uart_write+0x98/0x278
        n_tty_write+0x1dc/0x4f0
        file_tty_write.constprop.0+0x12c/0x2bc
        redirected_tty_write+0xc0/0x108
        do_iter_readv_writev+0xdc/0x1c0
        vfs_writev+0xf0/0x280
        do_writev+0x74/0x13c
        __arm64_sys_writev+0x20/0x2c
        invoke_syscall+0x48/0x10c
        el0_svc_common.constprop.0+0x40/0xe8
        do_el0_svc+0x20/0x2c
        el0_svc+0x50/0x2e8
        el0t_64_sync_handler+0xa0/0xe4
        el0t_64_sync+0x198/0x19c

-> #1 (&port_lock_key){-.-.}-{3:3}:
        _raw_spin_lock_irqsave+0x60/0x88
        qcom_geni_serial_console_write+0x50/0x344
        console_flush_one_record+0x33c/0x474
        console_unlock+0x80/0x14c
        vprintk_emit+0x258/0x3d0
        vprintk_default+0x38/0x44
        vprintk+0x28/0x34
        _printk+0x5c/0x84
        register_console+0x278/0x510
        serial_core_register_port+0x6cc/0x79c
        serial_ctrl_register_port+0x10/0x1c
        uart_add_one_port+0x10/0x1c
        qcom_geni_serial_probe+0x2c0/0x448
        platform_probe+0x5c/0xac
        really_probe+0xbc/0x298
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0x40/0x164
        __device_attach_driver+0xb8/0x138
        bus_for_each_drv+0x80/0xdc
        __device_attach+0xa8/0x1b0
        device_initial_probe+0x50/0x54
        bus_probe_device+0x38/0xa8
        device_add+0x540/0x720
        of_device_add+0x44/0x60
        of_platform_device_create_pdata+0x90/0x11c
        of_platform_bus_create+0x17c/0x394
        of_platform_populate+0x58/0xf8
        devm_of_platform_populate+0x58/0xbc
        geni_se_probe+0xdc/0x164
        platform_probe+0x5c/0xac
        really_probe+0xbc/0x298
        __driver_probe_device+0x78/0x12c
        driver_probe_device+0x40/0x164
        __device_attach_driver+0xb8/0x138
        bus_for_each_drv+0x80/0xdc
        __device_attach+0xa8/0x1b0
        device_initial_probe+0x50/0x54
        bus_probe_device+0x38/0xa8
        deferred_probe_work_func+0x8c/0xc8
        process_one_work+0x208/0x604
        worker_thread+0x244/0x388
        kthread+0x150/0x228
        ret_from_fork+0x10/0x20

-> #0 (console_owner){..-.}-{0:0}:
        __lock_acquire+0x1408/0x2254
        lock_acquire+0x1c8/0x354
        console_lock_spinning_enable+0x68/0x78
        console_flush_one_record+0x300/0x474
        console_unlock+0x80/0x14c
        vprintk_emit+0x258/0x3d0
        dev_vprintk_emit+0xd8/0x1a0
        dev_printk_emit+0x58/0x80
        __dev_printk+0x3c/0x88
        _dev_err+0x60/0x88
        __pm_runtime_set_status+0x28c/0x384
        pci_bus_add_device+0xa4/0x18c
        pci_bus_add_devices+0x3c/0x88
        pci_bus_add_devices+0x68/0x88
        pci_rescan_bus+0x30/0x44
        rescan_work_func+0x28/0x3c
        process_one_work+0x208/0x604
        worker_thread+0x244/0x388
        kthread+0x150/0x228
        ret_from_fork+0x10/0x20

other info that might help us debug this:

Chain exists of:
   console_owner --> &dev->power.lock --> &dev->power.lock/1

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&dev->power.lock/1);
                                lock(&dev->power.lock);
lock(&dev->power.lock/1);
   lock(console_owner);

  *** DEADLOCK ***

7 locks held by kworker/3:0/33:
  #0: ffff00008000d948 ((wq_completion)events){+.+.}-{0:0}, at: 
process_one_work+0x18c/0x604
  #1: ffff8000802a3de0 ((work_completion)(&pwrctrl->work)){+.+.}-{0:0}, 
at: process_one_work+0x1b4/0x604
  #2: ffffcd18301138e8 (pci_rescan_remove_lock){+.+.}-{4:4}, at: 
pci_lock_rescan_remove+0x1c/0x28
  #3: ffff00008ac8a238 (&dev->power.lock){-...}-{3:3}, at: 
__pm_runtime_set_status+0x1d4/0x384
  #4: ffff0000835c5238 (&dev->power.lock/1){....}-{3:3}, at: 
__pm_runtime_set_status+0x240/0x384
  #5: ffffcd182ff1ac78 (console_lock){+.+.}-{0:0}, at: 
dev_vprintk_emit+0xd8/0x1a0
  #6: ffffcd182ff1acd8 (console_srcu){....}-{0:0}, at: 
console_flush_one_record+0x0/0x474

stack backtrace:
CPU: 3 UID: 0 PID: 33 Comm: kworker/3:0 Not tainted 6.19.0-rc1+ #16398 
PREEMPT
Hardware name: Qualcomm Technologies, Inc. Robotics RB5 (DT)
Workqueue: events rescan_work_func
Call trace:
  show_stack+0x18/0x24 (C)
  dump_stack_lvl+0x90/0xd0
  dump_stack+0x18/0x24
  print_circular_bug+0x298/0x37c
  check_noncircular+0x15c/0x170
  __lock_acquire+0x1408/0x2254
  lock_acquire+0x1c8/0x354
  console_lock_spinning_enable+0x68/0x78
  console_flush_one_record+0x300/0x474
  console_unlock+0x80/0x14c
  vprintk_emit+0x258/0x3d0
  dev_vprintk_emit+0xd8/0x1a0
  dev_printk_emit+0x58/0x80
  __dev_printk+0x3c/0x88
  _dev_err+0x60/0x88
  __pm_runtime_set_status+0x28c/0x384
  pci_bus_add_device+0xa4/0x18c
  pci_bus_add_devices+0x3c/0x88
  pci_bus_add_devices+0x68/0x88
  pci_rescan_bus+0x30/0x44
  rescan_work_func+0x28/0x3c
  process_one_work+0x208/0x604
  worker_thread+0x244/0x388
  kthread+0x150/0x228
  ret_from_fork+0x10/0x20

This looks a bit similar to the issue reported some time ago on a 
different board:

https://lore.kernel.org/all/6d438995-4d6d-4a21-9ad2-8a0352482d44@samsung.com/

Let me know if I can somehow help debugging this.

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland