[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPDyKFr3WMZQxFgzn7E7mOtecu-mnfoZ-D051pgGhPV5Eeb5BQ@mail.gmail.com>
Date: Mon, 4 Dec 2023 11:59:15 +0100
From: Ulf Hansson <ulf.hansson@...aro.org>
To: Stephan Gerhold <stephan@...hold.net>
Cc: Stephan Gerhold <stephan.gerhold@...nkonzept.com>,
Viresh Kumar <viresh.kumar@...aro.org>,
Andy Gross <agross@...nel.org>,
Bjorn Andersson <andersson@...nel.org>,
Konrad Dybcio <konrad.dybcio@...aro.org>,
Ilia Lin <ilia.lin@...nel.org>,
"Rafael J. Wysocki" <rafael@...nel.org>,
Rob Herring <robh+dt@...nel.org>,
Krzysztof Kozlowski <krzysztof.kozlowski+dt@...aro.org>,
Conor Dooley <conor+dt@...nel.org>, linux-pm@...r.kernel.org,
linux-arm-msm@...r.kernel.org, linux-kernel@...r.kernel.org,
devicetree@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH 1/4] cpufreq: qcom-nvmem: Handling multiple power domains
On Tue, 28 Nov 2023 at 13:42, Stephan Gerhold <stephan@...hold.net> wrote:
>
> Hi Uffe,
>
> On Mon, Oct 16, 2023 at 04:47:52PM +0200, Ulf Hansson wrote:
> > [...]
> > > > > - MSM8916 (CPR+RPMPD):
> > > > > https://github.com/msm8916-mainline/linux/commit/8880f39108206d7a60a0a8351c0373bddf58657c
> > > >
> > > > This looks a bit odd to me. Does a CPU really have four different
> > > > power-domains, where three of them are performance-domains?
> > > >
> > >
> > > Good question. I think we're largely entering "uncharted territory" with
> > > these questions, I can just try to answer it the best I can from the
> > > limited documentation and knowledge I have. :)
> > >
> > > The CPU does indeed use four different power domains. There also seem to
> > > be additional power switches that gate power for some components without
> > > having to turn off the entire supply.
> > >
> > > I'll list them twice from two points of view: Once mapping component ->
> > > power domain, then again showing each power domain separately to make it
> > > more clear. At the end I also want to make clear that MSM8909 (with the
> > > "single" power domain) is actually exactly the same SoC design, just
> > > with different regulators supplying the power domains.
> > >
> > > It's totally fine if you just skim over it. I'm listing it in detail
> > > also as reference for myself. :D
> > >
> > > # Components
> > > - SoC
> > > - CPU subsystem ("APPS")
> > > - CPU cluster
> > > - 4x CPU core (logic and L1 cache) -> VDD_APC
> > > - Shared L2 cache
> > > - Logic -> VDD_APC
> > > - Memory -> VDD_MX
> > > - CPU clock controller (logic) -> VDD_CX
> > > - Provides CPU frequency from different clock sources
> > > - L2 cache runs at 1/2 of CPU frequency
> > > => Both VDD_APC and VDD_MX must be scaled based on frequency
> > > - CPU PLL clock source
> > > - Generates the higher (GHz) CPU frequencies
> > > - Logic (?, unsure) -> VDD_CX
> > > - ??? -> VDD_SR2_APPS_PLL
> > > => VDD_CX must be scaled based on PLL frequency
> > >
> > > # Power Domains
> > > ## VDD_APC
> > > - dedicated for CPU
> > > - powered off completely in deepest cluster cpuidle state
> > >
> > > - per-core power switch (per-core cpuidle)
> > > - CPU logic
> > > - L1 cache controller/logic and maybe memory(?, unsure)
> > > - shared L2 cache controller/logic
> > >
> > > => must be scaled based on CPU frequency
> > >
> > > ## VDD_MX
> > > - global SoC power domain for "on-chip memories"
> > > - always on, reduced to minimal voltage when entire SoC is idle
> > >
> > > - power switch (controlled by deepest cluster cpuidle state?, unsure)
> > > - L2 cache memory
> > >
> > > => must be scaled based on L2 frequency (=> 1/2 CPU frequency)
> > >
> > > ## VDD_CX
> > > - global SoC power domain for "digital logic"
> > > - always on, reduced to minimal voltage when entire SoC is idle
> > > - voting for VDD_CX in the RPM firmware also affects VDD_MX performance
> > > state (firmware implicitly sets VDD_MX >= VDD_CX)
> > >
> > > - CPU clock controller logic, CPU PLL logic(?, unsure)
> > >
> > > => must be scaled based on CPU PLL frequency
> > >
> > > ## VDD_SR2_APPS_PLL
> > > - global SoC power domain for CPU clock PLLs
> > > - on MSM8916: always on with constant voltage
> > >
> > > => ignored in Linux at the moment
> > >
> > > # Power Domain Regulators
> > > These power domains are literally input pins on the SoC chip. In theory
> > > one could connect any suitable regulator to each of those. In practice
> > > there are just a couple of standard reference designs that everyone
> > > uses:
> > >
> > > ## MSM8916 (SoC) + PM8916 (PMIC)
> > > We need to scale 3 power domains together with cpufreq:
> > >
> > > - VDD_APC (CPU logic) = &pm8916_spmi_s2 (via CPR)
> > > - VDD_MX (L2 memory) = &pm8916_l3 (via RPMPD: MSM8916_VDDMX)
> > > - VDD_CX (CPU PLL) = &pm8916_s1 (via RPMPD: MSM8916_VDDCX)
> > >
> > > ## MSM8909 (SoC) + PM8909 (PMIC)
> > > We need to scale 1 power domain together with cpufreq:
> > >
> > > - VDD_APC = VDD_CX = &pm8909_s1 (via RPMPD: MSM8909_VDDCX)
> > > (CPU logic, L2 logic and CPU PLL)
> > > (- VDD_MX (L2 memory) = &pm8909_l3 (RPM firmware enforces VDD_MX >= VDD_CX))
> > >
> > > There is implicit magic in the RPM firmware here that saves us from
> > > scaling VDD_MX. VDD_CX/APC are the same power rail.
> > >
> > > ## MSM8909 (SoC) + PM8916 (PMIC)
> > > When MSM8909 is paired with PM8916 instead of PM8909, the setup is
> > > identical to MSM8916+PM8916. We need to scale 3 power domains.
> > >
> > > > In a way it sounds like an option could be to hook up the cpr to the
> > > > rpmpd:s instead (possibly even set it as a child-domains to the
> > > > rpmpd:s), assuming that is a better description of the HW, which it
> > > > may not be, of course.
> > >
> > > Hm. It's definitely an option. I must admit I haven't really looked
> > > much at child-domains so far, so spontaneously I'm not sure about
> > > the implications, for both the abstract hardware description and
> > > the implementation.
> > >
> > > There seems to be indeed some kind of relation between MX <=> CX/APC:
> > >
> > > - When voting for CX in the RPM firmware, it will always implicitly
> > > adjust the MX performance state to be MX >= CX.
> > >
> > > - When scaling APC up, we must increase MX before APC.
> > > - When scaling APC down, we must decrease MX after APC.
> > > => Clearly MX >= APC. Not in terms of raw voltage, but at least for the
> > > abstract performance state.
> > >
> > > Is this some kind of parent-child relationship between MX <=> CX and
> > > MX <=> APC?
> >
> > Thanks for sharing the above. Yes, to me, it looks like there is a
> > parent/child-domain relationship that could be worth describing/using.
> >
> > >
> > > If yes, maybe we could indeed bind MX to the CPR genpd somehow. They use
> > > different performance state numbering, so we need some kind of
> > > translation. I'm not entirely sure how that would be described.
> >
> > Both the power-domain and the required-opps DT bindings
> > (Documentation/devicetree/bindings/opp/opp-v2-base.yaml) are already
> > allowing us to describe these kinds of hierarchical
> > dependencies/layouts.
> >
> > In other words, to scale performance for a child domain, the child may
> > rely on that we scale performance for the parent domain too. This is
> > already supported by genpd and through the opp library - so it should
> > just work. :-)
> >
>
> I'm getting back to the "multiple power domains" case of MSM8916 now, as
> discussed above. I've tried modelling MX as parent genpd of CPR, to
> avoid having to scale multiple power domains as part of cpufreq.
>
> Basically, it looks like the following:
>
> cpr: power-controller@...8000 {
> compatible = "qcom,msm8916-cpr", "qcom,cpr";
> reg = <0x0b018000 0x1000>;
> /* ... */
> #power-domain-cells = <0>;
> operating-points-v2 = <&cpr_opp_table>;
> /* Supposed to be parent domain, not consumer */
> power-domains = <&rpmpd MSM8916_VDDMX_AO>;
>
> cpr_opp_table: opp-table {
> compatible = "operating-points-v2-qcom-level";
>
> cpr_opp1: opp1 {
> opp-level = <1>;
> qcom,opp-fuse-level = <1>;
> required-opps = <&rpmpd_opp_svs_soc>;
> };
> cpr_opp2: opp2 {
> opp-level = <2>;
> qcom,opp-fuse-level = <2>;
> required-opps = <&rpmpd_opp_nom>;
> };
> cpr_opp3: opp3 {
> opp-level = <3>;
> qcom,opp-fuse-level = <3>;
> required-opps = <&rpmpd_opp_super_turbo>;
> };
> };
> };
>
> As already discussed [1] it's a bit annoying that the genpd core
> attaches the power domain as consumer by default, but I work around this
> by calling of_genpd_add_subdomain() followed by dev_pm_domain_detach()
> in the CPR driver.
Yep, that seems reasonable to me.
>
> The actual scaling works fine, performance states of the MX power domain
> are updated when CPR performance state. I added some debug prints and it
> looks e.g. as follows (CPR is the power-controller@):
>
> [ 24.498218] PM: mx_ao set performance state 6
> [ 24.498788] PM: power-controller@...8000 set performance state 3
> [ 24.511025] PM: mx_ao set performance state 3
> [ 24.511526] PM: power-controller@...8000 set performance state 1
> [ 24.521189] PM: mx_ao set performance state 4
> [ 24.521660] PM: power-controller@...8000 set performance state 2
> [ 24.533183] PM: mx_ao set performance state 6
> [ 24.533535] PM: power-controller@...8000 set performance state 3
>
> There is one remaining problem here: Consider e.g. the switch from CPR
> performance state 3 -> 1. In both cases the parent genpd state is set
> *before* the child genpd. When scaling down, the parent genpd state must
> be reduced *after* the child genpd. Otherwise, we can't guarantee that
> the parent genpd state is always >= of the child state.
Good point!
>
> In the OPP core, the order of such operations is always chosen based on
> whether we are scaling up or down. When scaling up, power domain states
> are set before the frequency is changed, and the other way around for
> scaling down.
>
> Is this something you could imagine changing in the GENPD core, either
> unconditionally for everyone, or as an option?
This sounds like a generic problem that we need to fix for genpd. So
for everyone.
>
> I tried to hack this in for a quick test and came up with the following
> (the diff is unreadable so I'll just post the entire changed
> (_genpd_set_performance_state() function). Admittedly it's a bit ugly.
>
> With these changes the sequence from above looks more like:
>
> [ 22.374555] PM: mx_ao set performance state 6
> [ 22.375175] PM: power-controller@...8000 set performance state 3
> [ 22.424661] PM: power-controller@...8000 set performance state 1
> [ 22.425169] PM: mx_ao set performance state 3
> [ 22.434932] PM: mx_ao set performance state 4
> [ 22.435331] PM: power-controller@...8000 set performance state 2
> [ 22.461197] PM: mx_ao set performance state 6
> [ 22.461968] PM: power-controller@...8000 set performance state 3
>
> Which is correct now.
>
> Let me know if you have any thoughts about this. :-)
Makes sense! Please post the below as a formal patch so I can review
and test it!
Kind regards
Uffe
>
> Thanks for taking the time to discuss this!
> Stephan
>
> [1]: https://lore.kernel.org/linux-pm/CAPDyKFq+zsoeF-4h5TfT4Z+S46a501_pUq8y2c1x==Tt6EKBGA@mail.gmail.com/
>
> static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
> unsigned int state, int depth);
>
> static void _genpd_rollback_parent_state(struct gpd_link *link, int depth)
> {
> struct generic_pm_domain *parent = link->parent;
> int parent_state;
>
> genpd_lock_nested(parent, depth + 1);
>
> parent_state = link->prev_performance_state;
> link->performance_state = parent_state;
>
> parent_state = _genpd_reeval_performance_state(parent, parent_state);
> if (_genpd_set_performance_state(parent, parent_state, depth + 1)) {
> pr_err("%s: Failed to roll back to %d performance state\n",
> parent->name, parent_state);
> }
>
> genpd_unlock(parent);
> }
>
> static int _genpd_set_parent_state(struct generic_pm_domain *genpd,
> struct gpd_link *link,
> unsigned int state, int depth)
> {
> struct generic_pm_domain *parent = link->parent;
> int parent_state, ret;
>
> /* Find parent's performance state */
> ret = genpd_xlate_performance_state(genpd, parent, state);
> if (unlikely(ret < 0))
> return ret;
>
> parent_state = ret;
>
> genpd_lock_nested(parent, depth + 1);
>
> link->prev_performance_state = link->performance_state;
> link->performance_state = parent_state;
> parent_state = _genpd_reeval_performance_state(parent,
> parent_state);
> ret = _genpd_set_performance_state(parent, parent_state, depth + 1);
> if (ret)
> link->performance_state = link->prev_performance_state;
>
> genpd_unlock(parent);
>
> return ret;
> }
>
> static int _genpd_set_performance_state(struct generic_pm_domain *genpd,
> unsigned int state, int depth)
> {
> struct gpd_link *link = NULL;
> int ret;
>
> if (state == genpd->performance_state)
> return 0;
>
> /* When scaling up, propagate to parents first in normal order */
> if (state > genpd->performance_state) {
> list_for_each_entry(link, &genpd->child_links, child_node) {
> ret = _genpd_set_parent_state(genpd, link, state, depth);
> if (ret)
> goto rollback_parents_up;
> }
> }
>
> if (genpd->set_performance_state) {
> pr_err("%s set performance state %d\n", genpd->name, state);
> ret = genpd->set_performance_state(genpd, state);
> if (ret) {
> if (link)
> goto rollback_parents_up;
> return ret;
> }
> }
>
> /* When scaling down, propagate to parents after in reverse order */
> if (state < genpd->performance_state) {
> list_for_each_entry_reverse(link, &genpd->child_links, child_node) {
> ret = _genpd_set_parent_state(genpd, link, state, depth);
> if (ret)
> goto rollback_parents_down;
> }
> }
>
> genpd->performance_state = state;
> return 0;
>
> rollback_parents_up:
> list_for_each_entry_continue_reverse(link, &genpd->child_links, child_node)
> _genpd_rollback_parent_state(link, depth);
> return ret;
> rollback_parents_down:
> list_for_each_entry_continue(link, &genpd->child_links, child_node)
> _genpd_rollback_parent_state(link, depth);
> return ret;
> }
>
Powered by blists - more mailing lists