lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aycyodp7ay5ruceoafly5xs4aopfbp7itylnvlqz7d7pkttsrr@ra466kihripk>
Date: Fri, 9 Jan 2026 12:48:26 -0600
From: Bjorn Andersson <andersson@...nel.org>
To: Stephan Gerhold <stephan.gerhold@...aro.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@....qualcomm.com>, 
	Konrad Dybcio <konrad.dybcio@....qualcomm.com>, Krishna Chaitanya Chundru <krishna.chundru@....qualcomm.com>, 
	Michael Turquette <mturquette@...libre.com>, Stephen Boyd <sboyd@...nel.org>, 
	Taniya Das <quic_tdas@...cinc.com>, Konrad Dybcio <konradybcio@...nel.org>, 
	Bartosz Golaszewski <brgl@...nel.org>, Shazad Hussain <quic_shazhuss@...cinc.com>, 
	Sibi Sankar <sibi.sankar@....qualcomm.com>, Bryan O'Donoghue <bryan.odonoghue@...aro.org>, 
	Melody Olvera <quic_molvera@...cinc.com>, Dmitry Baryshkov <lumag@...nel.org>, 
	Taniya Das <taniya.das@....qualcomm.com>, Dmitry Baryshkov <dmitry.baryshkov@....qualcomm.com>, 
	Imran Shaik <quic_imrashai@...cinc.com>, Abel Vesa <abelvesa@...nel.org>, linux-arm-msm@...r.kernel.org, 
	linux-clk@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Rajendra Nayak <quic_rjendra@...cinc.com>, stable@...r.kernel.org
Subject: Re: [PATCH 0/7] clk: qcom: gcc: Do not turn off PCIe GDSCs during
 gdsc_disable()

On Fri, Jan 09, 2026 at 06:33:18PM +0100, Stephan Gerhold wrote:
> On Fri, Jan 09, 2026 at 09:49:52AM -0600, Bjorn Andersson wrote:
> > On Mon, Jan 05, 2026 at 10:47:29AM +0100, Stephan Gerhold wrote:
> > > On Mon, Jan 05, 2026 at 10:44:39AM +0530, Manivannan Sadhasivam wrote:
> > > > On Fri, Jan 02, 2026 at 02:57:56PM +0100, Konrad Dybcio wrote:
> > > > > On 1/2/26 2:19 PM, Krishna Chaitanya Chundru wrote:
> > > > > > On 1/2/2026 5:09 PM, Konrad Dybcio wrote:
> > > > > >> On 1/2/26 12:36 PM, Krishna Chaitanya Chundru wrote:
> > > > > >>> On 1/2/2026 5:04 PM, Konrad Dybcio wrote:
> > > > > >>>> On 1/2/26 10:43 AM, Krishna Chaitanya Chundru wrote:
> > > > > >>>>> With PWRSTS_OFF_ON, PCIe GDSCs are turned off during gdsc_disable(). This
> > > > > >>>>> can happen during scenarios such as system suspend and breaks the resume
> > > > > >>>>> of PCIe controllers from suspend.
> > > > > >>>> Isn't turning the GDSCs off what we want though? At least during system
> > > > > >>>> suspend?
> > > > > >>> If we are keeping link in D3cold it makes sense, but currently we are not keeping in D3cold
> > > > > >>> so we don't expect them to get off.
> > > > > >> Since we seem to be tackling that in parallel, it seems to make sense
> > > > > >> that adding a mechanism to let the PCIe driver select "on" vs "ret" vs
> > > > > >> "off" could be useful for us
> > > > > > At least I am not aware of such API where we can tell genpd not to turn off gdsc
> > > > > > at runtime if we are keeping the device in D3cold state.
> > > > > > But anyway the PCIe gdsc supports Retention, in that case adding this flag here makes
> > > > > > more sense as it represents HW.
> > > > > > sm8450,sm8650 also had similar problem which are fixed by mani[1].
> > > > > 
> > > > > Perhaps I should ask for a clarification - is retention superior to
> > > > > powering the GDSC off? Does it have any power costs?
> > > > > 
> > > > 
> > > > In terms of power saving it is not superior, but that's not the only factor we
> > > > should consider here. If we keep GDSCs PWRSTS_OFF_ON, then the devices (PCIe)
> > > > need to be be in D3Cold. Sure we can change that using the new genpd API
> > > > dev_pm_genpd_rpm_always_on() dynamically, but I would prefer to avoid doing
> > > > that.
> > > > 
> > > > In my POV, GDSCs default state should be retention, so that the GDSCs will stay
> > > > ON if the rentention is not entered in hw and enter retention otherwise. This
> > > > requires no extra modification in the genpd client drivers. One more benefit is,
> > > > the hw can enter low power state even when the device is not in D3Cold state
> > > > i.e., during s2idle (provided we unvote other resources).
> > > > 
> > > 
> > > What about PCIe instances that are completely unused? The boot firmware
> > > on X1E for example is notorious for powering on completely unused PCIe
> > > links and powering them down in some half-baked off state (the &pcie3
> > > instance, in particular). I'm not sure if the GDSC remains on, but if it
> > > does then the unused PD cleanup would also only put them in retention
> > > state. I can't think of a good reason to keep those on at all.
> > > 
> > 
> > Conceptually I agree, but do we have any data indicating that there's
> > practical benefit to this complication?
> > 
> 
> No, I also suggested this only from the conceptual perspective. It would
> be interesting to test this, but unfortunately I don't have a suitable
> device for testing this anymore.
> 

I doubt that we're anywhere close to the power levels we need to measure
this with any confidence today as well...

> > > The implementation of PWRSTS_RET_ON essentially makes the PD power_off()
> > > callback a no-op. Everything in Linux (sysfs, debugfs, ...) will tell
> > > you that the power domain has been shut down, but at the end it will
> > > remain fully powered until you manage to reach a retention state for the
> > > parent power domain. Due to other consumers, that will likely happen
> > > only if you reach VDDmin or some equivalent SoC-wide low-power state,
> > > something barely any (or none?) of the platforms supported upstream is
> > > capable of today.
> > > 
> > 
> > Yes, PWRSTS_RET_ON effectively means that Linux has "dropped its vote"
> > on the GDSC and its parents. But with the caveat that we assume when
> > going to ON again some state will have been retained.
> > 
> > > PWRSTS_RET_ON is actually pretty close to setting GENPD_FLAG_ALWAYS_ON,
> > > the only advantage of PWRSTS_RET_ON I can think of is that unused GDSCs
> > > remain off iff you are lucky enough that the boot firmware has not
> > > already turned them on.
> > > 
> > 
> > Doesn't GENPD_FLAG_ALWAYS_ON imply that the parent will also be always
> > on?
> > 
> 
> It probably does, but isn't that exactly what you want? If the parent
> (or the GDSC itself) would actually *power off* (as in "pull the plug"),
> then you would still lose registers even if the GDSC remains on. The
> fact that PWRSTS_RET_ON works without keeping the parent on is probably
> just because the hardware keeps the parent domain always-on?
> 

No, that's not what we want.

As part of entering and exiting CXPC, the system will switch the
resources related to the GDSC between CX and some other rail (generally
some form of MX). In other words, the hardware will "reparent" the
resources under the hood. So we do want to leave the GDSC in a "active,
but according to software inactive" state, and we need to let go of the
votes on the parents.


We typically list e.g. CX as the parent of the GDSC, giving us the two
side effects: 1) an active GDSC casts an active vote on the CX, 2) a
performance_state vote on the GDSC will trickle up to CX.

In particular the latter allow us to describe devices with a single
power-domain, and let them through that relationship cast votes on CX.
See e.g. &usb_1_ss0 in hamoa.dtsi; the GDSC doesn't have a concept of
performance states, but to sustain the given assigned-clock-rates we
need to keep CX at NOM.

> > > IMHO, for GDSCs that support OFF state in the hardware, PWRSTS_RET_ON is
> > > a hack to workaround limitations in the consumer drivers. They should
> > > either save/restore registers and handle the power collapse or they
> > > should vote for the power domain to stay on. That way, sysfs/debugfs
> > > will show the real votes held by Linux and you won't be mislead when
> > > looking at those while trying to optimize power consumption.
> > > 
> > 
> > No, it's not working around limitations in the consumer drivers.
> > 
> > It does work around a limitation in the API, in that the consumer
> > drivers can't indicate in which cases they would be willing to restore
> > and in which cases they would prefer retention. This is something the
> > downstream solution has had, but we don't have a sensible and generic
> > way to provide this.
> 
> I might be missing something obvious, but mapping this to the existing
> pmdomain API feels pretty straightforward to me:
> 
>  - Power on/power off means "pull the plug", i.e. if you vote for a
>    pmdomain to power off you should expect that registers get lost.
>    That's exactly what will typically happen if the hardware actually
>    removes power completely from the power domain.
> 

I think you should rather see it as a question of does the requester
need the power to be on. In many/most cases you have some form of logic
or reference counting between the consumer and the actual hardware
switch.

So a client doesn't vote for a resource to be turned off, it rather
removes its vote for it to be on. The layers below will then aggregate
votes, implement sleep timeouts etc, and might turn off the power.

>  - If you want to preserve registers (retention), you need to tell the
>    hardware to keep the pmdomain powered on at a minimum voltage
>    (= performance state). In fact, the PCIe GDSC already inherits
>    support for RPMH_REGULATOR_LEVEL_RETENTION from its parent domain.
>    (If RPMH_REGULATOR_LEVEL_RETENTION happens to be higher than the
>     rentention state we are talking about here you could also just vote
>     for 0 performance state...)
> 

This world view does makes sense, but instead of keeping CX as a
"retention" level, the Qualcomm platform retain state through some other
power rail, allowing CX to collapse fully - bringing with it all those
parts that hasn't been built or configured to switch to retention power.

> With this, the only additional feature you need from the pmdomain API is
> to disable its sometimes inconvenient feature to automatically disable
> all pmdomains during system suspend (independent of the votes made by
> drivers). I believe this exists already in different forms. Back when
> I needed something like this for cpufreq on MSM8909, Ulf suggested using
> device_set_awake_path(), see commit d6048a19a710 ("cpufreq: qcom-nvmem:
> Preserve PM domain votes in system suspend"). I'm not entirely up to
> date what is the best way currently to do this, but letting a driver
> preserve its votes across system suspend feels like a common enough
> requirement that should be supported by the pmdomain API.
> 

In the downstream Qualcomm kernels you can find various mechanism for
client drivers to dynamically change this behavior for both clocks and
GDSCs. There has been "corner cases" through history where in certain
use cases you want retention and in other don't.

No such APIs exists in the PM or clock APIs.

> > 
> > Keeping GDSCs in retention is a huge gain when it comes to the time it
> > takes to resume the system after being in low power. PCIe is a good
> > example of this, where the GDSC certainly support entering OFF, at the
> > cost of tearing link and all down.
> > 
> 
> I don't doubt that. My point is that the PCIe driver should make that
> decision and not the (semi-)generic power domain driver that does not
> exactly know who (or if anyone) is going to consume its power domain.
> Especially because this decision is encoded in SoC-specific data and we
> had plenty of patches already changing PWRSTS_OFF_ON to PWRSTS_RET_ON
> due to suspend issues initially unnoticed on some SoCs (or vice-versa to
> save power).
> 

That idea that the responsibility for making such decisions would better
lie with the client I agree with.

But on the other hand, extending the APIs to allow such fine grained
tweaks makes the system more integrated and complex. Conceptually I
would prefer us moving towards the Linux power model, with individual
devices having an abstract power state, rather than sprinkling tweaks
throughout.


Ultimately, I don't think this it's worth exposing such controls. We're
orders of magnitude away from decent power consumption in sleep, and we
still can't boot the system reliably without clk_ignore_unused. Once
we're past that, I'd be happy to be proven wrong by some measurements.

Regards,
Bjorn

> Thanks,
> Stephan

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ