[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALHNRZ894WcNaAuLFoDLwJ8mXDRM8PzdqRFzcyYUMPy+0q0nMw@mail.gmail.com>
Date: Wed, 3 Sep 2025 03:01:14 -0500
From: Aaron Kling <webgeek1234@...il.com>
To: Mikko Perttunen <mperttunen@...dia.com>
Cc: Michael Turquette <mturquette@...libre.com>, Stephen Boyd <sboyd@...nel.org>,
Rob Herring <robh@...nel.org>, Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley <conor+dt@...nel.org>,
Thierry Reding <thierry.reding@...il.com>, Jonathan Hunter <jonathanh@...dia.com>,
Joseph Lo <josephl@...dia.com>, Peter De Schrijver <pdeschrijver@...dia.com>,
Prashant Gaikwad <pgaikwad@...dia.com>, linux-clk@...r.kernel.org, devicetree@...r.kernel.org,
linux-tegra@...r.kernel.org, linux-kernel@...r.kernel.org,
Thierry Reding <treding@...dia.com>
Subject: Re: [PATCH 5/5] arm64: tegra: Limit max cpu frequency on P3450
On Wed, Sep 3, 2025 at 2:29 AM Mikko Perttunen <mperttunen@...dia.com> wrote:
>
> On Wednesday, September 3, 2025 3:28 PM Aaron Kling wrote:
> > On Wed, Sep 3, 2025 at 12:50 AM Mikko Perttunen <mperttunen@...dia.com> wrote:
> > >
> > > On Saturday, August 16, 2025 2:53 PM Aaron Kling via B4 Relay wrote:
> > > > From: Aaron Kling <webgeek1234@...il.com>
> > > >
> > > > P3450's cpu is only rated for 1.4 GHz while the CVB table it uses tries
> > > > to scale to 1.5 GHz. Set an appropriate limit on the maximum scaling
> > > > frequency.
> > >
> > > Looking at downstream, from what I can tell, the CPU's maximum frequency is indeed 1.55GHz under normal conditions. However, at temperatures over 90C, its voltage is limited to 1090mV. Reference:
> > >
> > > static struct dvfs_therm_limits
> > > tegra210_core_therm_caps_ucm2[MAX_THERMAL_LIMITS] = {
> > > {86, 1090},
> > > {0, 0},
> > > };
> > > (rel-32 kernel-4.9/drivers/soc/tegra/tegra210-dvfs.c)
> > >
> > > Here the throttling is set at 86C, I suppose to give some margin.
> > >
> > > 1090mV perfectly matches the 1.479GHz operating point defined in the upstream kernel. So it seems to me that rather than setting a maximum frequency, we would need temperature dependent DVFS. Or, at least as a first step, we could have the driver just always limit the maximum frequency so it fits under the thermal cap voltage -- the temperature limit is rather high, after all.
> > >
> > > If you have other information, please do tell.
> >
> > I am basing on this line in the downstream porg dt repo:
> >
> > nvidia,dfll-max-freq-khz = <1479000>;
> > (tegra-l4t-r32.7.6_good kernel-dts/tegra210-porg-p3448-common.dtsi)
> >
> > Which in the downstream dfll driver limits the max frequency it will use:
> >
> > max_freq = fcpu_data->cpu_max_freq_table[speedo_id];
> > if (!of_property_read_u32(pdev->dev.of_node, "nvidia,dfll-max-freq-khz",
> > &f))
> > max_freq = min(max_freq, f * 1000UL);
> > (tegra-l4t-r32.7.6_good drivers/clk/tegra/clk-tegra124-dfll-fcpu.c)
> >
> > If I read the commit history correctly, it does appear that this limit
> > was set because the always-on use case was failing thermal tests. I
> > couldn't say if it was intentional that this throttling was applied to
> > all use cases or not, but that is what appears to have happened. Hence
> > trying to replicate here in an effort to squash stability issues.
>
> I can't see any reference to failing thermal tests. Can to point to the commit?
In the porg dt repo, commit hash d1326f08, which adds the
nvidia,dfll-max-freq-khz property, the message body states: "Set
CPU/GPU Fmax limit for 24x7 105C UCM." I read that to mean that the
24x7 always-on use case model was failing to stay under 105C unless
the cpu and gpu frequencies were limited. Is that an incorrect
reading? 105C is kind of a crazy number anyways, beyond the soctherm
critical shutdown temperature.
> I looked into why this was added for porg -- it does not seem to be related to reliability, but more so consistency of performance. I don't think that's a huge concern for upstream -- though in any case we should be capping the frequency in the DFLL driver for now since we don't support dynamic thermal capping.
So the whole conversation winds around to: The change is valid, but
the commit message needs better justification?
As a side note: I'm still chasing multiple stability issues on various
t210 devices. Though, the only one I've seen on p3450/p3541 is that
nouveau intermittently fails to init the gpu. Just hangs on probe and
eventually something times out, stack traces, and causes a panic
reboot. Seems to be about a 50/50 chance for me, but works fine if
probe succeeds. For another dev, it only works once in a blue moon,
but still dies shortly thereafter even if probe works. I thought it
might be related to the cpu/gpu getting 'overclocked'. But even after
this series, the problem persists. So maybe me calling this underclock
a stability fix is inaccurate. But stability issues still exist.
Aaron
Powered by blists - more mailing lists