linux-kernel - Re: [PATCH v4 0/8] PCI/pwrctrl: Major rework to integrate pwrctrl devices with controller drivers

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ef5d5fdc-be08-4859-a625-cdd1ae0c46c2@seco.com>
Date: Thu, 15 Jan 2026 14:26:32 -0500
From: Sean Anderson <sean.anderson@...o.com>
To: Manivannan Sadhasivam <mani@...nel.org>
Cc: manivannan.sadhasivam@....qualcomm.com,
 Lorenzo Pieralisi <lpieralisi@...nel.org>,
 Krzysztof Wilczyński <kwilczynski@...nel.org>,
 Rob Herring <robh@...nel.org>, Bjorn Helgaas <bhelgaas@...gle.com>,
 Bartosz Golaszewski <brgl@...ev.pl>, linux-pci@...r.kernel.org,
 linux-arm-msm@...r.kernel.org, linux-kernel@...r.kernel.org,
 Chen-Yu Tsai <wens@...nel.org>, Brian Norris <briannorris@...omium.org>,
 Krishna Chaitanya Chundru <krishna.chundru@....qualcomm.com>,
 Niklas Cassel <cassel@...nel.org>, Alex Elder <elder@...cstar.com>,
 Bartosz Golaszewski <bartosz.golaszewski@....qualcomm.com>,
 Chen-Yu Tsai <wenst@...omium.org>,
 Bartosz Golaszewski <bartosz.golaszewski@...aro.org>
Subject: Re: [PATCH v4 0/8] PCI/pwrctrl: Major rework to integrate pwrctrl
 devices with controller drivers

On 1/14/26 03:48, Manivannan Sadhasivam wrote:
> On Tue, Jan 13, 2026 at 12:15:01PM -0500, Sean Anderson wrote:
>> On 1/5/26 08:55, Manivannan Sadhasivam via B4 Relay wrote:
>> > Hi,
>> 
>> I asked substantially similar questions on v2, but since I never got a
>> response I want to reiterate them on v4 to make sure they don't get
>> lost.
>> 
> 
> I did respond to your queries in v2, but lost your last reply in that thread:
> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flinux%2dpci%2f8269249f%2d48a9%2d4136%2da326%2d23f5076be487%40linux.dev%2f&umid=db5ea813-d162-4dc2-9847-b6f01a3e22ce&rct=1768380513&auth=d807158c60b7d2502abde8a2fc01f40662980862-8453843623be88c725b7c9a8baf78220575003f2
> 
> Apologies!
> 
>> > This series provides a major rework for the PCI power control (pwrctrl)
>> > framework to enable the pwrctrl devices to be controlled by the PCI controller
>> > drivers.
>> > 
>> > Problem Statement
>> > =================
>> > 
>> > Currently, the pwrctrl framework faces two major issues:
>> > 
>> > 1. Missing PERST# integration
>> > 2. Inability to properly handle bus extenders such as PCIe switch devices
>> > 
>> > First issue arises from the disconnect between the PCI controller drivers and
>> > pwrctrl framework. At present, the pwrctrl framework just operates on its own
>> > with the help of the PCI core. The pwrctrl devices are created by the PCI core
>> > during initial bus scan and the pwrctrl drivers once bind, just power on the
>> > PCI devices during their probe(). This design conflicts with the PCI Express
>> > Card Electromechanical Specification requirements for PERST# timing. The reason
>> > is, PERST# signals are mostly handled by the controller drivers and often
>> > deasserted even before the pwrctrl drivers probe. According to the spec, PERST#
>> > should be deasserted only after power and reference clock to the device are
>> > stable, within predefined timing parameters.
>> > 
>> > The second issue stems from the PCI bus scan completing before pwrctrl drivers
>> > probe. This poses a significant problem for PCI bus extenders like switches
>> > because the PCI core allocates upstream bridge resources during the initial
>> > scan. If the upstream bridge is not hotplug capable, resources are allocated
>> > only for the number of downstream buses detected at scan time, which might be
>> > just one if the switch was not powered and enumerated at that time. Later, when
>> > the pwrctrl driver powers on and enumerates the switch, enumeration fails due to
>> > insufficient upstream bridge resources.
>> 
>> OK, so to clarify the problem is an architecture like
>> 
>>     RP
>>     |-- Bridge 1 (automatic)
>>     |   |-- Device 1
>>     |   `-- Bridge 2 (needs pwrseq)
>>     |       `-- Device 2
>>     `-- Bridge 3 (automatic)
>>         `-- Device 3
>> 
> 
> This topology is not possible with PCIe. A single Root Port can only connect to
> a single bridge. But applies to PCI.

OK, well imagine it like

     RP
     `-- Host Bridge (automatic)
         |-- Bridge 1 (automatic)
         |   |-- Device 1
         |   `-- Bridge 2 (needs pwrseq)
         |       `-- Device 2
         `-- Bridge 3 (automatic)
             `-- Device 3

You raised the problem, so what I am asking is: is this such a
problematic topology? And if not, please describe one.

>> where Bridge 2 has a devicetree node with a pwrseq binding? So we do the
>> initial scan and allocate resources for bridge/devices 1 and 3 with the
>> resources for bridge 3 immediately above those for bridge 1. Then when
>> bridge 2 shows up we can't resize bridge 1's windows since bridge 3's
>> windows are in the way?
>> 
> 
> It is not a problem with resizing, it is the problem with how much you can
> resize. And also if that bridge 2 is a switch and if it exposes multiple
> downstream busses, then the upstream bridge 1 will run out of resources.

OK, but what I am saying is that I don't believe Bridge 2 can need
pwrseq if Bridge 1 doesn't. So I don't think the topology as-illustrated
can exist.

It's possible that there could be a problem with multiple levels of
bridges all needing pwrseq, but does such a system exist?

> If bridge 2 is a hotplug bridge, then no issues. But I was only referring to
> non-hotplug capable switches.
> 
>> But is it even valid to have a pwrseq node on bridge 2 without one on
>> bridge 1? If bridge 1 is automatically controlled, then I would expect
>> bridge 2 to be as well. E.g. I would expect bridge 2's reset sequence to
>> be controlled by the secondary bus reset bit in bridge 1's bridge
>> control register.
>> 
> 
> Technically it is possible for Bridge 2 to have a pwrctrl requirement. What is
> limiting from spec PoV?

If this is the case then we need to be able to handle the resource
constraint problem. But if it doesn't exist then there is no problem
with the existing architecture. Only this sort of design has resource
problems, while most designs like

     RP
     `-- Bridge 1 (pwrseq)
         |-- Bridge 2 (automatic)
         |   |-- Device 1
         |   |-- Device 2
         `-- Bridge 3 (automatic)
             `-- Device 3

have no resource problems even with the current subsystem.

>> And a very similar architecture like
>> 
>>     RP
>>     |-- Bridge 4 (pwrseq)
>>     |   |-- Device 4
>>     `-- Bridge 5 (automatic)
>>         `-- Device 5
>> 
>> has no problems since the resources for bridge 4 can be allocated above
>> those for bridge 5 whenever it shows up.
>> 
> 
> Again, if bridge 4 is not hotplug capable and if it is a switch, the problem is
> still applicable.

This doesn't apply even if bridge 4 is not hotplug capable. It will show
up after bridge 5 gets probed and just grab the next available
resources.

>> These problems seem very similar to what hotplug bridges have to handle
>> (except that we (usually) only need to do one hotplug per boot). So
>> maybe we should set is_hotplug_bridge on bridges with a pwrseq node.
>> That way they'll get resources distributed for when the downstream port
>> shows up. As an optimization, we could then release those resources once
>> the downstream port is scanned.
>> 
> 
> That would be incorrect. You cannot set 'is_hotplug_bridge' to 'true' for a
> non-hotplug capable bridge. You can call it as a hack, but there is no place
> for that in upstream.

Introduce a new boolean called 'is_pwrseq_bridge' and check for it when
allocating resources.

>> > Proposal
>> > ========
>> > 
>> > This series addresses both issues by introducing new individual APIs for pwrctrl
>> > device creation, destruction, power on, and power off operations. Controller
>> > drivers are expected to invoke these APIs during their probe(), remove(),
>> > suspend(), and resume() operations.
>> 
>> (just for the record)
>> 
>> I think the existing design is quite elegant, since the operations
>> associated with the bridge correspond directly to device lifecycle
>> operations. It also avoids problems related to the root port trying to
>> look up its own child (possibly missing a driver) during probe.
>> 
> 
> I agree with you that it is elegant and I even was very reluctant to move out of
> it [1]. But lately, I understood that we cannot scale the pwrctrl framework if we
> do not give flexibility to the controller drivers [2].
> 
> [1] https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flinux%2dpci%2feix65qdwtk5ocd7lj6sw2lslidivauzyn6h5cc4mc2nnci52im%40qfmbmwy2zjbe%2f&umid=db5ea813-d162-4dc2-9847-b6f01a3e22ce&rct=1768380513&auth=d807158c60b7d2502abde8a2fc01f40662980862-377ad79c69a5ff9c69de76d9fcf5f030d066027a
> [2] https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flinux%2dpci%2faG3IWdZIhnk01t2A%40google.com%2f&umid=db5ea813-d162-4dc2-9847-b6f01a3e22ce&rct=1768380513&auth=d807158c60b7d2502abde8a2fc01f40662980862-9a33d827cf703f2827fca86fd99acf563ca26bd9
> 
>> > This integration allows better coordination
>> > between controller drivers and the pwrctrl framework, enabling enhanced features
>> > such as D3Cold support.
>> 
>> 
>> I think this should be handled by the power sequencing driver,
>> especially as there are timing requirements for the other resources
>> referenced to PERST? If we are going to touch each driver, it would
>> be much better to consolidate things by removing the ad-hoc PERST
>> support.
>> 
>> Different drivers control PERST in various ways, but I think this can
>> be abstracted behind a GPIO controller (if necessary for e.g. MMIO-based
>> control). If there's no reset-gpios property in the pwrseq node then we
>> could automatically look up the GPIO on the root port.
>> 
> 
> Not at all. We cannot model PERST# as a GPIO in all the cases. Some drivers
> implement PERST# as a set of MMIO operations in the Root Complex MMIO space and
> that space belongs to the controller driver.

That's what I mean. Implement a GPIO driver with one GPIO and perform
the MMIO operations as requested.

Or we can invert things and add a reset op to pci_ops. If present then
call it, and if absent use the PERST GPIO on the bridge.

> FYI, I did try something similar before:
> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flore.kernel.org%2flinux%2dpci%2f20250707%2dpci%2dpwrctrl%2dperst%2dv1%2d0%2dc3c7e513e312%40kernel.org%2f&umid=db5ea813-d162-4dc2-9847-b6f01a3e22ce&rct=1768380513&auth=d807158c60b7d2502abde8a2fc01f40662980862-e06652b06144d91b37cae1f9289747fe7cbe0762
>> > The original design aimed to avoid modifying controller drivers for pwrctrl
>> > integration. However, this approach lacked scalability because different
>> > controllers have varying requirements for when devices should be powered on. For
>> > example, controller drivers require devices to be powered on early for
>> > successful PHY initialization.
>> 
>> Can you elaborate on this? Previously you said
>> 
>> | Some platforms do LTSSM during phy_init(), so they will fail if the
>> | device is not powered ON at that time.
>> 
>> What do you mean by "do LTSSM during phy_init()"? Do you have a specific
>> driver in mind?
>> 
> 
> I believe the Mediatek PCIe controller driver used in Chromebooks exhibit this
> behavior. Chen talked about it in his LPC session:
> https://cas5-0-urlprotect.trendmicro.com:443/wis/clicktime/v1/query?url=https%3a%2f%2flpc.events%2fevent%2f19%2fcontributions%2f2023%2f&umid=db5ea813-d162-4dc2-9847-b6f01a3e22ce&rct=1768380513&auth=d807158c60b7d2502abde8a2fc01f40662980862-59ecd8a94baa970f1f962febb6fe20f15058ef42

I went through 

mediatek/phy-mtk-pcie.c
mediatek/phy-mtk-tphy.c
mediatek/phy-mtk-xsphy.c
ralink/phy-mt7621-pci.c

and didn't see anything where they wait for the link to come up or check
the link state and fail.

The mtk PCIe drivers may check for this, but I'm saying that we
*shouldn't* do that in probe.

>> I would expect that the LTSSM would remain in the Detect state until the
>> pwrseq driver is being probed.
>> 
> 
> True, but if the API (phy_init()) expects the LTSSM to move to L0, then it will
> fail, right? It might be what's happening with above mentioned platform.

How can the API expect this?

I'm not saying that such a situation cannot exist, but I don't think
it's a common case.

>> > By using these explicit APIs, controller drivers gain fine grained control over
>> > their associated pwrctrl devices.
>> > 
>> > This series modified the pcie-qcom driver (only consumer of pwrctrl framework)
>> > to adopt to these APIs and also removed the old pwrctrl code from PCI core. This
>> > could be used as a reference to add pwrctrl support for other controller drivers
>> > also.
>> > 
>> > For example, to control the 3.3v supply to the PCI slot where the NVMe device is
>> > connected, below modifications are required:
>> > 
>> > Devicetree
>> > ----------
>> > 
>> > 	// In SoC dtsi:
>> > 
>> > 	pci@...8000 { // controller node
>> > 		...
>> > 		pcie1_port0: pcie@0 { // PCI Root Port node
>> > 			compatible = "pciclass,0604"; // required for pwrctrl
>> > 							 driver bind
>> > 			...
>> > 		};
>> > 	};
>> > 
>> > 	// In board dts:
>> > 
>> > 	&pcie1_port0 {
>> > 		reset-gpios = <&tlmm 152 GPIO_ACTIVE_LOW>; // optional
>> > 		vpcie3v3-supply = <&vreg_nvme>; // NVMe power supply
>> > 	};
>> > 
>> > Controller driver
>> > -----------------
>> > 
>> > 	// Select PCI_PWRCTRL_SLOT in controller Kconfig
>> > 
>> > 	probe() {
>> > 		...
>> > 		// Initialize controller resources
>> > 		pci_pwrctrl_create_devices(&pdev->dev);
>> > 		pci_pwrctrl_power_on_devices(&pdev->dev);
>> > 		// Deassert PERST# (optional)
>> > 		...
>> > 		pci_host_probe(); // Allocate host bridge and start bus scan
>> > 	}
>> > 
>> > 	suspend {
>> > 		// PME_Turn_Off broadcast
>> > 		// Assert PERST# (optional)
>> > 		pci_pwrctrl_power_off_devices(&pdev->dev);
>> > 		...
>> > 	}
>> > 
>> > 	resume {
>> > 		...
>> > 		pci_pwrctrl_power_on_devices(&pdev->dev);
>> > 		// Deassert PERST# (optional)
>> > 	}
>> > 
>> > I will add a documentation for the pwrctrl framework in the coming days to make
>> > it easier to use.
>> > 
>> > Testing
>> > =======
>> > 
>> > This series is tested on the Lenovo Thinkpad T14s laptop based on Qcom X1E
>> > chipset and RB3Gen2 development board with TC9563 switch based on Qcom QCS6490
>> > chipset.
>> > 
>> > **NOTE**: With this series, the controller driver may undergo multiple probe
>> > deferral if the pwrctrl driver was not available during the controller driver
>> > probe. This is pretty much required to avoid the resource allocation issue. I
>> > plan to replace probe deferral with blocking wait in the coming days.
>> 
>> You can only do a blocking wait after deferring at least once, since the
>> root port may be probed synchronously during boot. I really think this
>> is rather messy and something we should avoid architecturally while we
>> have the chance.
>> 
> 
> By blocking wait I meant that the controller probe itself will do a blocking
> wait until the pwrctrl drivers gets bound. Since this happens way before the PCI
> bus scan, there won't be any Root Port probed synchronously.

You can't do that because the pwrctrl driver may *never* be loaded. And
this may deadlock the boot sequence because the initial probe is
performed synchronously from the initcall. i.e.

do_initcalls
  my_driver_init
    driver_register
      bus_add_driver
        driver_attach
          driver_probe_device

If the PCI controller is probed before the device that has the module
you will deadlock! So you can only sleep indefinitely if you are being
probed asynchronously.

-----

Maybe the best way to address this is to add assert_reset/link_up/
link_down callbacks to pci_ops. Then pwrctrl_slot probe could look like

    bridge = to_pci_host_bridge(dev->parent);
    of_regulator_bulk_get_all();
    regulator_bulk_enable();
    devm_clk_get_optional_enabled();
    devm_gpiod_get_optional(/* "reset" */);
    if (bridge && bridge->ops->assert_reset)
        ret = bridge->ops->assert_reset(bridge, slot)
    else
        ret = assert_reset_gpio(slot);

    if (ret != ALREADY_ASSERTED)
	    fdelay(100000);

    /* Deassert PERST and bring the link up */
    if (bridge && bridge->ops->link_up)
        bridge->ops->link_up(bridge, slot);
    else
        slot_deassert_reset(slot);

    devm_add_action_or_reset(link_down);
    pci_pwrctrl_init();
    devm_pci_pwrctrl_device_set_ready();

--Sean