linux-kernel - Re: [PATCH v4 3/5] memory: tegra186-emc: Support non-bpmp icc scaling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALHNRZ9y0n6JNfeDUQgZoECkxo+We0_G8TP0H4advcSqrX86kg@mail.gmail.com>
Date: Tue, 9 Dec 2025 23:06:35 -0600
From: Aaron Kling <webgeek1234@...il.com>
To: Jon Hunter <jonathanh@...dia.com>
Cc: Krzysztof Kozlowski <krzk@...nel.org>, Rob Herring <robh@...nel.org>, Conor Dooley <conor+dt@...nel.org>, 
	Thierry Reding <thierry.reding@...il.com>, linux-kernel@...r.kernel.org, 
	devicetree@...r.kernel.org, linux-tegra@...r.kernel.org
Subject: Re: [PATCH v4 3/5] memory: tegra186-emc: Support non-bpmp icc scaling

On Tue, Dec 9, 2025 at 10:08 PM Jon Hunter <jonathanh@...dia.com> wrote:
>
>
> On 21/11/2025 18:17, Aaron Kling wrote:
> > On Fri, Nov 21, 2025 at 5:21 AM Jon Hunter <jonathanh@...dia.com> wrote:
> >>
> >>
> >> On 12/11/2025 07:21, Aaron Kling wrote:
> >>> On Wed, Nov 12, 2025 at 12:18 AM Jon Hunter <jonathanh@...dia.com> wrote:
> >>>>
> >>>>
> >>>> On 11/11/2025 23:17, Aaron Kling wrote:
> >>>>
> >>>> ...
> >>>>
> >>>>> Alright, I think I've got the picture of what's going on now. The
> >>>>> standard arm64 defconfig enables the t194 pcie driver as a module. And
> >>>>> my simple busybox ramdisk that I use for mainline regression testing
> >>>>> isn't loading any modules. If I set the pcie driver to built-in, I
> >>>>> replicate the issue. And I don't see the issue on my normal use case,
> >>>>> because I have the dt changes as well.
> >>>>>
> >>>>> So it appears that the pcie driver submits icc bandwidth. And without
> >>>>> cpufreq submitting bandwidth as well, the emc driver gets a very low
> >>>>> number and thus sets a very low emc freq. The question becomes... what
> >>>>> to do about it? If the related dt changes were submitted to
> >>>>> linux-next, everything should fall into place. And I'm not sure where
> >>>>> this falls on the severity scale since it doesn't full out break boot
> >>>>> or prevent operation.
> >>>>
> >>>> Where are the related DT changes? If we can get these into -next and
> >>>> lined up to be merged for v6.19, then that is fine. However, we should
> >>>> not merge this for v6.19 without the DT changes.
> >>>
> >>> The dt changes are here [0].
> >>
> >> To confirm, applying the DT changes do not fix this for me. Thierry is
> >> having a look at this to see if there is a way to fix this.
> >>
> >> BTW, I have also noticed that Thierry's memory frequency test [0] is
> >> also failing on Tegra186. The test simply tries to set the frequency via
> >> the sysfs and this is now failing. I am seeing ...
> >>
> >> memory: emc: - available rates: (* = current)
> >> memory: emc:   -   40800000
> >> memory: emc:   -   68000000
> >> memory: emc:   -  102000000
> >> memory: emc:   -  204000000
> >> memory: emc:   -  408000000
> >> memory: emc:   -  665600000
> >> memory: emc:   -  800000000
> >> memory: emc:   - 1062400000
> >> memory: emc:   - 1331200000
> >> memory: emc:   - 1600000000
> >> memory: emc:   - 1866000000 *
> >> memory: emc: - testing:
> >> memory: emc:   -   40800000...OSError: [Errno 34] Numerical result out
> >> of range
> >
> > Question. Does this test run and pass on jetson-tk1? I based the
> > tegra210 and tegra186 [0] code on tegra124 [1]. And I don't see a
> > difference in the flow now. What appears to be happening is that icc
> > is reporting a high bandwidth, setting the emc min_freq to something
> > like 1600MHz. Then debugfs is having max_freq set to something low
> > like 40.8MHz. Then the linked code block fails because the higher of
> > the min_freqs is greater than the lower of the max_freqs. But if this
> > same test is run on jetson-tk1, I don't see how it passes. Unless
> > maybe the t124 actmon is consistently setting min freqs during the
> > tests.
>
> So we don't currently run this test on Tegra124. We could certainly try.
> I don't recall if there was an issue that prevented us from doing so now.
>
> > An argument could be made that any attempt to set debugfs should win a
> > conflict with icc. That could be done. But if that needs done here,
> > I'd argue that it needs replicated across all other applicable emc
> > drivers too.
>
> The bottom line is that we cannot regress anything that was working before.

Let me try to iterate the potential issues I've seen stated here. If
I'm missing anything, please fill in the blanks.

1) If this change is applied without the related dt change and the
pcie drvier is loaded, the emc clock can become stuck at the lowest
rate. This is caused by the pcie driver providing icc data, but
nothing else is. So the very low requested bandwidth results in the
emc clock being set very low. I'm not sure there is a 'fix' for this,
beyond making sure the dt change is merged to ensure that the cpufreq
driver provides bandwidth info, causing the emc driver to select a
more reasonable emc clock rate. This is a similar situation to what's
currently blocking the tegra210 actmon series. I don't think there is
a way for the drivers to know if icc data is missing/wrong. The
scaling is doing exactly what it's told based on the icc routing given
in the dt.

2) Jon, you report that even with both this change and the related dt
change, that the issue is still not fixed. But then posted a log
showing that the emc rate is set to max. If the issue is that emc rate
is too low, then how can debugfs report that the rate is max? For
reference, everything scales as expected for me given this change plus
the dt change on both p2771 and p3636+p3509.

3) If icc is requesting enough bandwidth to set the emc clock to a
high value, then a user tries to set debugfs max_freq to a lower
value, this code will reject the change. I do not believe this is an
issue unique to this code. tegra20-emc, tegra30-emc, and tegra124-emc
all have this same flow. And so does my proposed change to
tegra210-emc-core in the actmon series. This is why I asked if
tegra124 ran this test, to see if the failure was unique. If this is
not a unique failure, then I'd argue that all instances need changed,
not just this one causing diverging results depending on the soc being
utilized. A lot of the work I'm doing is to try to bring unity and
feature parity to all the tegra socs I'm working on. I don't want to
cause even more divergence.

What actions need taken for which issue?

Aaron