linux-kernel - Re: [RFC] PCI: pwrctrl and link-up dependencies

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z_6kZ7x7gnoH-P7x@google.com>
Date: Tue, 15 Apr 2025 11:24:39 -0700
From: Brian Norris <briannorris@...omium.org>
To: Manivannan Sadhasivam <manivannan.sadhasivam@...aro.org>
Cc: Bartosz Golaszewski <brgl@...ev.pl>,
	Bjorn Helgaas <bhelgaas@...gle.com>,
	Jingoo Han <jingoohan1@...il.com>,
	Lorenzo Pieralisi <lpieralisi@...nel.org>,
	Krzysztof Wilczyński <kw@...ux.com>,
	Krishna Chaitanya Chundru <krishna.chundru@....qualcomm.com>,
	Rob Herring <robh@...nel.org>, linux-pci@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-arm-msm@...r.kernel.org,
	dmitry.baryshkov@...aro.org, Tsai Sung-Fu <danielsftsai@...gle.com>
Subject: Re: [RFC] PCI: pwrctrl and link-up dependencies

Hi,

On Tue, Apr 15, 2025 at 11:02:14AM +0530, Manivannan Sadhasivam wrote:
> On Mon, Apr 14, 2025 at 04:24:38PM -0700, Brian Norris wrote:
> > On Mon, Apr 14, 2025 at 04:27:35PM +0530, Manivannan Sadhasivam wrote:
> > > On Tue, Apr 08, 2025 at 12:59:36PM -0700, Brian Norris wrote:
> > > > For link startup, pcie-designware-host.c currently
> > > > (a) starts the link via platform-specific means (dw_pcie::ops::start_link()) and
> > > > (b) waits for the link training to complete.
> > > > 
> > > > However, (b) will fail if the other end of the link is not powered up --
> > > > e.g., if the appropriate pwrctrl driver has not yet loaded, or its
> > > > device hasn't finished probing. Today, this can mean the designware
> > > > driver will either fail to probe,
> > > 
> > > This is not correct.
> > 
> > That depends on the implementation of start_link(). But I suppose the
> > intention is that start_link() only "starts" and doesn't care where
> > things go from there. (IOW, my local start_link() implementation is
> > probably wrong at the moment, as it performs some poll/retry steps too.)
> > 
> 
> The callback is supposed to just start
> the link and not wait for anything else.

Ack, thanks. I've learned something.

> > > > or at least waste time for a condition
> > > > that we can't achieve (link up), depending on the HW/driver
> > > > implementation.
> > > > 
> > > 
> > > Unfortunately we cannot avoid this waiting time as we don't know if a device is
> > > attached to the bus or not. The 1s wait time predates my involvement with DWC
> > > drivers.
> > 
> > I don't really love that answer. It means that any DWC-based platform
> > that needs pwrctrl and doesn't set use_link_irq==true will waste 1
> > second per PCIe controller. While it's hard to make guarantees about old
> > and/or unloved drivers, I'd like to think I can do better on new ones.
> > 
> 
> Even I'd like to avoid the 1s delay. But the problem is how would you know if
> the device is attached to the bus or not. The delay is to account for the fact
> that the link may take up to 1s to come up post starting LTSSM. So if we do not
> wait for that period, there is a chance that we would report the false negative
> status and also the enumeration would fail.

I understand there are cases we won't know, if we don't have a
hotplug/presence-detect wiring. But for cases we know, I think it's
cop-out to say "we can't handle it." See below.

> > Consider
> > the lesser statement that was paired along with it: always wasting 1
> > second per controller polling for something that will never happen. It
> > feels backwards and wasteful.
> > 
> 
> Again, I do get your point. But tell me how can a controller reliably detect
> that there is a device attached to the bus. Only on your android setup, you for
> sure know that the device won't be there during probe. So you are considering 1s
> wait as a wast of time and it is fair. But what if the same controller is used
> in another platform which is not android or the endpoint device is powered on
> during probe itself without replying on userspace?

This has nothing to do with Android.

IMO, we have 3 main categories of setups that we should primarily care
about:

(1) hotplug is supported, and PRSNT1#/PRSNT2# are wired
(2) hotplug is not supported, but a device is present and is already
    powered.
(3) hotplug is not supported, but a device is present. the device
    requires external power (i.e., pwrctrl / "subdevice regulators" /
    etc., should be involved)

AFAICT, we don't have much of (1). But if we did, we should also be able
to avoid initial delays, as we can reliably detect presence, and only
wait for training when we know it should succeed. (Or even better,
handle it async via an interrupt.)

For (2), we're also OK. The initial polling delay is likely to be much
less than 1 second.

For (3) ... all non-pwrctrl drivers (pcie-brcmstb.c, pcie-tegra.c,
pcie-tegra194.c, pcie-rockchip-host.c, ...) power things up before they
configure ports, start LTSSM, and have any expectation of detecting a
link. If a device is there, they again should commonly find it in much
less than 1 second.

However, when *using* pwrctrl, we have the ordering all wrong (IMO), and
so we eat needless delay. We *will not* successfully bring the link up,
and we *won't* find the device. This smells like a design problem, where
we have failed to plumb the information we already have available.

I think you're too worried about a case (4): that hotplug is not
supported, and a device is not present.

IMO, (4) should mostly be handled by simply disabling the unused
controller in device tree, or living with the timeouts. If a platform
doesn't support hotplug, then you can't expect optimal behavior for
unplugged devices.

I'm not complaining about (4). I'm complaining about (3) with pwrctrl.

> > One of my key questions: if I don't have a link-up IRQ, how can I avoid
> > this waste? pcie-brcmstb avoids that waste today (for the common case
> > where there is, in fact, a device connected), and it would be a
> > regression for it to start using pwrctrl tomorrow.
> > 
> 
> Why are you tying pwrctrl with this designware driver behavior? Both are
> unrelated. Even if you don't use pwrctrl and use controller driver to bring up
> the device, the 1s delay would be applicable (if there is no device).

We might be talking past each other. Per above, I think we can do better
with (1)-(3). But you're bringing up (4). Problem (3) exists for all
drivers, although it's more acute with DWC, and I happen to be using it.
I also think it's indicative of larger design and ordering problems in
pwrctrl.

> pcie-brcmstb driver indeed wastes time. It is not 1s but just 100ms. But that
> driver is for only one vendor. In the case of DWC, the driver has to work with
> multiple vendors. But again, I do not know how this 1s delay came up. Maybe we
> could try to reduce it to 500ms or so, but for that I need confirmation from
> someone like Lorenzo who knows the history.
> 
> > (Side note: I also just noticed pcie-tegra194.c does the same.)
> > 
> > > > My guess is that (1) is the case, and specifically that the relevant folks are
> > > > using the pcie-qcom.c, with its "global" IRQ used for link-up events.
> > > > 
> > > 
> > > We only recently added support for 'Link Up' event through 'global_irq' in the
> > > controller driver. And this was done to avoid waiting for link up during probe
> > 
> > You're kind of reinforcing my question: you don't like the waste, so
> > you're adding link-up IRQ support -- is that really the only way?
> > 
> 
> I don't know. But so far I haven't seen any other sensible way which is generic.
> 
> > (My initial thought: no, it's not. We know when pwrctrl has done its
> > thing -- why should we bother polling for link state before that? But
> > that's easier said than done, when pwrctrl is optional and highly
> > abstracted away from the DWC driver...)
> > 
> 
> Oh well... this is where you got it wrong. pwrctrl drivers are only probed
> before enumeration because of the design (which is way after starting the link).
> As of v6.15-rc1, before we try to enumerate any device, we check if there is any
> device defined in DT which requires power supply. If so, we create a platform
> device (or pwrctrl device) and let the pwrctrl driver to bind to it and power up
> the device. In that case, we also do not proceed to scan the bus further and
> skip the hierarchy. Because, the pwrctrl driver will rescan the bus once it has
> finished powering up the device.

It sounds like you're saying "it's the way that it is, because of the
way that it is." I understand how it is currently structured, but I'm
saying that I think pwrctrl is placed at the wrong place. It looks cute
and clean, but it has the ordering wrong.

IMO, we should allow pwrctrl to power things up earlier, so that
controller drivers have a better chance of hitting the optimal cases
(case (3) above) properly. (That's also how every pre-pwrctrl driver
does things, and I think it's for good reason.)

That would also resolve my PERST# and other timing questions, because
the controller driver would better know when pwrctrl is finished, and so
can better handle PERST# and any necessary delays.

I agree this might be more difficult to do in a "generic" way (per your
above language), depending on your definition of generic. But IMO, it's
important to prioritize doing things correctly, even if it's slightly
less cute.

As an example less-cute way of doing pwrctrl: expose a wrapped version
of pci_pwrctrl_create_device() such that drivers can call it earlier. If
there is a pwrctrl device created, that means a driver should not yet
wait for link-up -- it should defer that until the relevant pwrctrl is
marked "ready". (There are likely other problems to solve in here too,
but this is just an initial sketch. And to be clear, I suspect this
doesn't fit your notion of "generic", because it requires each driver to
adapt to it.)

> > Regarding the controller design: frankly, I don't think my controller
> > does anything all that revolutionary in this space [0]. All of my
> > questions today can be asked (from what I can tell) of existing upstream
> > controller drivers. I'm mostly trying to understand the expected driver
> > design here, and that includes teasing apart what is "stuff done in
> > 'old' drivers, but isn't recommended", and "what is currently
> > unimplemented in new stuff" (like pwrctrl [1]), and where do my
> > expectations fit in between that.
> > 
> > For instance, poking around a bit I come up with this question: when
> > using pci/pwrctrl, how does one ensure timing requirements around, e.g.,
> > power stability vs PERST# deassertion are met? When looking at a pwrctrl
> > driver like drivers/pci/pwrctrl/slot.c, the process looks too simple:
> > 
> > (0) host bridge probably already started its LTSSM, deasserted PERST#
> > (1) slot.c powers the slot
> > (2) pci_pwrctrl_device_set_ready() -> rescan_work_func() rescans the bus
> > 
> > Notably, there's no enforced delay between (1) and (2).
> > 
> > Reading the PCIe CEM, it seems we're violating some specification bits,
> > like:
> > 
> >   2.2. PERST# Signal
> >   [...] On power-up, the de-assertion of PERST# is delayed 100 ms
> >   (TPVPERL) from the power rails achieving specified operating limits.
> >   [...]
> > 
> > There are references to this in various implementations (e.g.,
> > tegra_pcie_enable_slot_regulators() and brcm_pcie_start_link() --
> > although I suspect the latter is applying the wrong ordering).
> > 
> > Additionaly, CEM also seems to suggest we have PERST# ordering wrong. It
> > should also come between (1) and (2), not at (0).
> > 
> 
> You are absolutely right! Currently, we follow the timing requirement while
> deasserting the PERST# in the controller drivers. But once we power on the slot,
> we do not touch PERST# and it just happen to work.
> 
> We may need to introduce another callback that toggles PERST# so that we can use
> it while powering up the device.
> 
> > And finally (for now), I don't understand how we have any guarantee that
> > step (2) is useful. Even if we've already started the LTSSM in (0), we
> > have no idea if the link is actually Active by the time we hit (2), and
> > so rescanning may not actually discover the device. And if that scan
> > fails ... then when do we trigger another pci_rescan_bus()? Only if the
> > implementation has a "link-up" IRQ?
> > 
> 
> As I said above, we do not enumerate the device if it has devicetree node with
> supplies. So that's why we need (2). Otherwise, the device won't be enumerated
> at all, unless userspace does the rescan (which defeats the purpose of pwrctrl).

But you haven't addressed one of the concerns in my paragraph: how do we
know the link is up by the time we hit the pwrctrl-initiated
pci_rescan_bus()? We haven't gone back to ask the host bridge if it's up
yet. We're just hoping we get lucky.

IOW, the pwrctl sequence should be something like:

 (1) power up the slot
 (1)(a) delay, per spec
 (1)(b) deassert PERST#
 (1)(c) wait for link up
 (2) rescan bus

Currently, we skip all of (1)(a)-(c). We're probably lucky that (1)(b)'s
ordering doesn't matter all the time, as long as we did it earlier. And
we're lucky that there are natural delays in software such that lack of
(1)(a) and (1)(c) aren't significant.

> > Unless I'm misunderstanding, these concerns all suggest we need some
> > host-bridge hook in between (1) and (2), and existing pwrctrl users are
> > operating somewhat outside the specification, or are relying on
> > something more subtle that I'm missing.
> > 
> 
> Yes, feel free to submit patches for toggling PERST#. Or let me know otherwise,
> I'll do it.

Per the above, I think pwrctrl has multiple ordering design problems, so
I haven't yet decided what kind of patch(es) I might submit. Feel free
to CC me if you tackle it before I do though.

Regards,
Brian