netdev - Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE LPI reset issues

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4fe02d97-2c38-4d40-b17d-5f8174d2f7cc@nvidia.com>
Date: Tue, 11 Mar 2025 13:25:58 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: "Russell King (Oracle)" <linux@...linux.org.uk>
Cc: Thierry Reding <treding@...dia.com>,
 "Lad, Prabhakar" <prabhakar.csengg@...il.com>,
 Alexandre Torgue <alexandre.torgue@...s.st.com>, Andrew Lunn
 <andrew@...n.ch>, Andrew Lunn <andrew+netdev@...n.ch>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Heiner Kallweit <hkallweit1@...il.com>, Jakub Kicinski <kuba@...nel.org>,
 linux-arm-kernel@...ts.infradead.org,
 linux-stm32@...md-mailman.stormreply.com,
 Maxime Coquelin <mcoquelin.stm32@...il.com>, netdev@...r.kernel.org,
 Paolo Abeni <pabeni@...hat.com>,
 "linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH RFC net-next v2 0/3] net: stmmac: approach 2 to solve EEE
 LPI reset issues


On 10/03/2025 14:20, Jon Hunter wrote:
> 
> On 07/03/2025 17:07, Russell King (Oracle) wrote:
>> On Fri, Mar 07, 2025 at 04:11:19PM +0000, Jon Hunter wrote:
>>> Hi Russell,
>>>
>>> On 06/03/2025 15:23, Russell King (Oracle) wrote:
>>>> Hi,
>>>>
>>>> This is a second approach to solving the STMMAC reset issues caused by
>>>> the lack of receive clock from the PHY where the media is in low power
>>>> mode with a PHY that supports receive clock-stop.
>>>>
>>>> The first approach centred around only addressing the issue in the
>>>> resume path, but it seems to also happen when the platform glue module
>>>> is removed and re-inserted (Jon - can you check whether that's also
>>>> the case for you please?)
>>>>
>>>> As this is more targetted, I've dropped the patches from this series
>>>> which move the call to phylink_resume(), so the link may still come
>>>> up too early on resume - but that's something I also intend to fix.
>>>>
>>>> This is experimental - so I value test reports for this change.
>>>
>>>
>>> The subject indicates 3 patches, but I only see 2 patches? Can you 
>>> confirm
>>> if there are 2 or 3?
>>
>> Yes, 2 patches is correct.
>>
>>> So far I have only tested to resume case with the 2 patches to make that
>>> that is working but on Tegra186, which has been the most problematic, 
>>> it is
>>> not working reliably on top of next-20250305.
>>
>> To confirm, you're seeing stmmac_reset() sporadically timing out on
>> resume even with these patches appled? That's rather disappointing.
> 
> So I am no longer seeing the reset fail, from what I can see, but now
> NFS is not responding after resume ...
> 
> [   49.825094] Enabling non-boot CPUs ...
> [   49.829760] Detected PIPT I-cache on CPU1
> [   49.832694] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU1: 0x0000009444c004
> [   49.844120] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU1: 0x00000010305116
> [   49.856231] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU1: 0x00000003001066
> [   49.868081] CPU1: Booted secondary processor 0x0000000000 [0x4e0f0030]
> [   49.875389] CPU1 is up
> [   49.877187] Detected PIPT I-cache on CPU2
> [   49.880824] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_CTR_EL0. Boot CPU: 0x0000008444c004, CPU2: 0x0000009444c004
> [   49.892266] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_AA64DFR0_EL1. Boot CPU: 0x00000010305106, CPU2: 0x00000010305116
> [   49.904467] CPU features: SANITY CHECK: Unexpected variation in 
> SYS_ID_DFR0_EL1. Boot CPU: 0x00000003010066, CPU2: 0x00000003001066
> [   49.916257] CPU2: Booted secondary processor 0x0000000001 [0x4e0f0030]
> [   49.923610] CPU2 is up
> [   49.925194] Detected PIPT I-cache on CPU3
> [   49.929010] CPU3: Booted secondary processor 0x0000000101 [0x411fd073]
> [   49.935866] CPU3 is up
> [   49.937983] Detected PIPT I-cache on CPU4
> [   49.941824] CPU4: Booted secondary processor 0x0000000102 [0x411fd073]
> [   49.948593] CPU4 is up
> [   49.950810] Detected PIPT I-cache on CPU5
> [   49.954651] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
> [   49.961431] CPU5 is up
> [   50.069784] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/ 
> rgmii link mode
> [   50.077634] dwmac4: Master AXI performs any burst length
> [   50.080718] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features 
> support found
> [   50.088172] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 
> Advanced Timestamp supported
> [   50.096851] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/ 
> Full - flow control rx/tx
> [   50.110897] usb-conn-gpio 3520000.padctl:ports:usb2-0:connector: 
> repeated role: device
> [   50.113922] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 
> 13:39:28 UTC
> [   50.147552] OOM killer enabled.
> [   50.148441] Restarting tasks ... done.
> [   50.152552] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   50.154761] random: crng reseeded on system resumption
> [   50.162912] PM: suspend exit
> [   50.212215] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   50.271578] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   50.338597] VDDIO_SDMMC3_AP: voltage operation not allowed
> [  234.474848] nfs: server 10.26.51.252 not responding, still trying
> [  234.538769] nfs: server 10.26.51.252 not responding, still trying
> [  237.546922] nfs: server 10.26.51.252 not responding, still trying
> [  254.762753] nfs: server 10.26.51.252 not responding, timed out
> [  254.762771] nfs: server 10.26.51.252 not responding, timed out
> [  254.766376] nfs: server 10.26.51.252 not responding, timed out
> [  254.766392] nfs: server 10.26.51.252 not responding, timed out
> [  254.783778] nfs: server 10.26.51.252 not responding, timed out
> [  254.789582] nfs: server 10.26.51.252 not responding, timed out
> [  254.795421] nfs: server 10.26.51.252 not responding, timed out
> [  254.801193] nfs: server 10.26.51.252 not responding, timed out
> 
>> Do either of the two attached diffs make any difference?
> 
> I will try these next.


I tried both of the diffs, but both had the same problem as above and
I see these nfs timeouts after resuming. What works the best is the
original change you proposed (this is based upon the latest two
patches) ...

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index e2146d3aee74..48a646b76a29 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3109,10 +3109,7 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
         if (priv->extend_desc && (priv->mode == STMMAC_RING_MODE))
                 priv->plat->dma_cfg->atds = 1;
  
-       /* Note that the PHY clock must be running for reset to complete. */
-       phylink_rx_clk_stop_block(priv->phylink);
         ret = stmmac_reset(priv, priv->ioaddr);
-       phylink_rx_clk_stop_unblock(priv->phylink);
         if (ret) {
                 netdev_err(priv->dev, "Failed to reset the dma\n");
                 return ret;
@@ -7951,6 +7948,8 @@ int stmmac_resume(struct device *dev)
         rtnl_lock();
         mutex_lock(&priv->lock);
  
+       /* Note that the PHY clock must be running for reset to complete. */
+       phylink_rx_clk_stop_block(priv->phylink);
         stmmac_reset_queues_param(priv);
  
         stmmac_free_tx_skbufs(priv);
@@ -7961,6 +7960,7 @@ int stmmac_resume(struct device *dev)
         stmmac_set_rx_mode(ndev);
  
         stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
+       phylink_rx_clk_stop_unblock(priv->phylink);
  
         stmmac_enable_all_queues(priv);
         stmmac_enable_all_dma_irq(priv);

-- 
nvpublic