linux-kernel - Re: [PATCH v4 3/5] memory: tegra186-emc: Support non-bpmp icc scaling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <f906f85f-b110-4328-b177-02fcdf7ffe53@nvidia.com>
Date: Wed, 10 Dec 2025 15:03:50 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: Aaron Kling <webgeek1234@...il.com>
Cc: Krzysztof Kozlowski <krzk@...nel.org>, Rob Herring <robh@...nel.org>,
 Conor Dooley <conor+dt@...nel.org>, Thierry Reding
 <thierry.reding@...il.com>, linux-kernel@...r.kernel.org,
 devicetree@...r.kernel.org, linux-tegra@...r.kernel.org
Subject: Re: [PATCH v4 3/5] memory: tegra186-emc: Support non-bpmp icc scaling


On 10/12/2025 05:06, Aaron Kling wrote:

...

> Let me try to iterate the potential issues I've seen stated here. If
> I'm missing anything, please fill in the blanks.
> 
> 1) If this change is applied without the related dt change and the
> pcie drvier is loaded, the emc clock can become stuck at the lowest
> rate. This is caused by the pcie driver providing icc data, but
> nothing else is. So the very low requested bandwidth results in the
> emc clock being set very low. I'm not sure there is a 'fix' for this,
> beyond making sure the dt change is merged to ensure that the cpufreq
> driver provides bandwidth info, causing the emc driver to select a
> more reasonable emc clock rate. This is a similar situation to what's
> currently blocking the tegra210 actmon series. I don't think there is
> a way for the drivers to know if icc data is missing/wrong. The
> scaling is doing exactly what it's told based on the icc routing given
> in the dt.

So this is the fundamental issue with this that must be fixed. We can't 
allow the PCIe driver to slow the system down. I think that Krzysztof 
suggested we need some way to determine if the necessary ICC clients are 
present/registered for ICC to work. Admittedly, I have no idea if there 
is a simple way to do this, but we need something like that.

> 2) Jon, you report that even with both this change and the related dt
> change, that the issue is still not fixed. But then posted a log
> showing that the emc rate is set to max. If the issue is that emc rate
> is too low, then how can debugfs report that the rate is max? For
> reference, everything scales as expected for me given this change plus
> the dt change on both p2771 and p3636+p3509.

To clarify, this broke the boot test on Tegra194 because the boot was 
too slow. However, this also broke the EMC test on Tegra186 because 
setting the frequency from the debugfs failed. So two different failures 
on two different devices. I am guessing the EMC test would also fail on 
Tegra194, but given that it does not boot, we did not get that far.

> 3) If icc is requesting enough bandwidth to set the emc clock to a
> high value, then a user tries to set debugfs max_freq to a lower
> value, this code will reject the change. I do not believe this is an
> issue unique to this code. tegra20-emc, tegra30-emc, and tegra124-emc
> all have this same flow. And so does my proposed change to
> tegra210-emc-core in the actmon series. This is why I asked if
> tegra124 ran this test, to see if the failure was unique. If this is
> not a unique failure, then I'd argue that all instances need changed,
> not just this one causing diverging results depending on the soc being
> utilized. A lot of the work I'm doing is to try to bring unity and
> feature parity to all the tegra socs I'm working on. I don't want to
> cause even more divergence.

Yes that is fair point, however, we need to detect this in the 
tegra-tests so that we know that this will not work. It would be nice if 
we could disable ICC from userspace and then run the test.

Bottom line here is that #1 is the problem that needs to be fixed.

Jon

-- 
nvpublic