[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <8194755.G0QQBjFxQf@senjougahara>
Date: Thu, 04 Sep 2025 09:55:58 +0900
From: Mikko Perttunen <mperttunen@...dia.com>
To: Aaron Kling <webgeek1234@...il.com>
Cc: Michael Turquette <mturquette@...libre.com>,
Stephen Boyd <sboyd@...nel.org>, Rob Herring <robh@...nel.org>,
Krzysztof Kozlowski <krzk+dt@...nel.org>, Conor Dooley <conor+dt@...nel.org>,
Thierry Reding <thierry.reding@...il.com>,
Jonathan Hunter <jonathanh@...dia.com>, Joseph Lo <josephl@...dia.com>,
Peter De Schrijver <pdeschrijver@...dia.com>,
Prashant Gaikwad <pgaikwad@...dia.com>, linux-clk@...r.kernel.org,
devicetree@...r.kernel.org, linux-tegra@...r.kernel.org,
linux-kernel@...r.kernel.org, Thierry Reding <treding@...dia.com>
Subject: Re: [PATCH 5/5] arm64: tegra: Limit max cpu frequency on P3450
On Wednesday, September 3, 2025 5:01 PM Aaron Kling wrote:
> On Wed, Sep 3, 2025 at 2:29 AM Mikko Perttunen <mperttunen@...dia.com> wrote:
> >
> > On Wednesday, September 3, 2025 3:28 PM Aaron Kling wrote:
> > > On Wed, Sep 3, 2025 at 12:50 AM Mikko Perttunen <mperttunen@...dia.com> wrote:
> > > >
> > > > On Saturday, August 16, 2025 2:53 PM Aaron Kling via B4 Relay wrote:
> > > > > From: Aaron Kling <webgeek1234@...il.com>
> > > > >
> > > > > P3450's cpu is only rated for 1.4 GHz while the CVB table it uses tries
> > > > > to scale to 1.5 GHz. Set an appropriate limit on the maximum scaling
> > > > > frequency.
> > > >
> > > > Looking at downstream, from what I can tell, the CPU's maximum frequency is indeed 1.55GHz under normal conditions. However, at temperatures over 90C, its voltage is limited to 1090mV. Reference:
> > > >
> > > > static struct dvfs_therm_limits
> > > > tegra210_core_therm_caps_ucm2[MAX_THERMAL_LIMITS] = {
> > > > {86, 1090},
> > > > {0, 0},
> > > > };
> > > > (rel-32 kernel-4.9/drivers/soc/tegra/tegra210-dvfs.c)
> > > >
> > > > Here the throttling is set at 86C, I suppose to give some margin.
> > > >
> > > > 1090mV perfectly matches the 1.479GHz operating point defined in the upstream kernel. So it seems to me that rather than setting a maximum frequency, we would need temperature dependent DVFS. Or, at least as a first step, we could have the driver just always limit the maximum frequency so it fits under the thermal cap voltage -- the temperature limit is rather high, after all.
> > > >
> > > > If you have other information, please do tell.
> > >
> > > I am basing on this line in the downstream porg dt repo:
> > >
> > > nvidia,dfll-max-freq-khz = <1479000>;
> > > (tegra-l4t-r32.7.6_good kernel-dts/tegra210-porg-p3448-common.dtsi)
> > >
> > > Which in the downstream dfll driver limits the max frequency it will use:
> > >
> > > max_freq = fcpu_data->cpu_max_freq_table[speedo_id];
> > > if (!of_property_read_u32(pdev->dev.of_node, "nvidia,dfll-max-freq-khz",
> > > &f))
> > > max_freq = min(max_freq, f * 1000UL);
> > > (tegra-l4t-r32.7.6_good drivers/clk/tegra/clk-tegra124-dfll-fcpu.c)
> > >
> > > If I read the commit history correctly, it does appear that this limit
> > > was set because the always-on use case was failing thermal tests. I
> > > couldn't say if it was intentional that this throttling was applied to
> > > all use cases or not, but that is what appears to have happened. Hence
> > > trying to replicate here in an effort to squash stability issues.
> >
> > I can't see any reference to failing thermal tests. Can to point to the commit?
>
> In the porg dt repo, commit hash d1326f08, which adds the
> nvidia,dfll-max-freq-khz property, the message body states: "Set
> CPU/GPU Fmax limit for 24x7 105C UCM." I read that to mean that the
> 24x7 always-on use case model was failing to stay under 105C unless
> the cpu and gpu frequencies were limited. Is that an incorrect
> reading? 105C is kind of a crazy number anyways, beyond the soctherm
> critical shutdown temperature.
What that's (trying) to say is that it sets the CPU's Fmax to the limit specified by the 24x7 105C UCM profile, which is the 1090mV i.e. 1.4GHz limit. The profile is called that because it's normally used for the 90C-105C temperature range.
>
> > I looked into why this was added for porg -- it does not seem to be related to reliability, but more so consistency of performance. I don't think that's a huge concern for upstream -- though in any case we should be capping the frequency in the DFLL driver for now since we don't support dynamic thermal capping.
>
> So the whole conversation winds around to: The change is valid, but
> the commit message needs better justification?
In my opinion, there is no need to add the device tree property in upstream. The CPU is designed to work at 1.5GHz under 90C, and 1.4GHz between 90C to 105C. I think this is a bit of a downstream-ism and not something we should add in upstream. If the user wants to underclock, then that should be through the cpufreq governor or such mechanism.
>
> As a side note: I'm still chasing multiple stability issues on various
> t210 devices. Though, the only one I've seen on p3450/p3541 is that
> nouveau intermittently fails to init the gpu. Just hangs on probe and
> eventually something times out, stack traces, and causes a panic
> reboot. Seems to be about a 50/50 chance for me, but works fine if
> probe succeeds. For another dev, it only works once in a blue moon,
> but still dies shortly thereafter even if probe works. I thought it
> might be related to the cpu/gpu getting 'overclocked'. But even after
> this series, the problem persists. So maybe me calling this underclock
> a stability fix is inaccurate. But stability issues still exist.
Good to know. It doesn't strike me as a CPU issue -- I'd put the first place to look at nouveau's init code itself to see what is failing. There's a lot of potential software issues that can cause intermittencies during GPU boot. If power related, GPU or SOC rail.
Thanks,
Mikko
>
> Aaron
Powered by blists - more mailing lists