netdev - Re: [PATCH net-next 9/9] net: stmmac: convert to phylink managed EEE support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <203871c2-c673-4a98-a0a3-299d1cf71cf0@nvidia.com>
Date: Wed, 19 Feb 2025 17:52:34 +0000
From: Jon Hunter <jonathanh@...dia.com>
To: "Russell King (Oracle)" <linux@...linux.org.uk>
Cc: Andrew Lunn <andrew@...n.ch>, Heiner Kallweit <hkallweit1@...il.com>,
 Alexandre Torgue <alexandre.torgue@...s.st.com>,
 Andrew Lunn <andrew+netdev@...n.ch>,
 Bryan Whitehead <bryan.whitehead@...rochip.com>,
 "David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
 Jakub Kicinski <kuba@...nel.org>, linux-arm-kernel@...ts.infradead.org,
 linux-stm32@...md-mailman.stormreply.com,
 Marcin Wojtas <marcin.s.wojtas@...il.com>,
 Maxime Coquelin <mcoquelin.stm32@...il.com>, netdev@...r.kernel.org,
 Paolo Abeni <pabeni@...hat.com>, UNGLinuxDriver@...rochip.com,
 "linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: stmmac: convert to phylink managed EEE
 support


On 19/02/2025 15:36, Russell King (Oracle) wrote:
> On Wed, Feb 19, 2025 at 02:01:34PM +0000, Jon Hunter wrote:
>>
>> On 14/02/2025 11:21, Russell King (Oracle) wrote:
>>> On Fri, Feb 14, 2025 at 10:58:55AM +0000, Jon Hunter wrote:
>>>> Thanks for the feedback. So ...
>>>>
>>>> 1. I can confirm that suspend works if I disable EEE via ethtool
>>>> 2. Prior to this change I do see phy_eee_rx_clock_stop being called
>>>>      to enable the clock resuming from suspend, but after this change
>>>>      it is not.
>>>>
>>>> Prior to this change I see (note the prints around 389-392 are when
>>>> we resume from suspend) ...
>>>>
>>>> [    4.654454] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 0
>>>
>>> This is a bug in phylink - it shouldn't have been calling
>>> phy_eee_rx_clock_stop() where a MAC doesn't support phylink managed EEE.
>>>
>>>> [    4.723123] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
>>>> [    7.629652] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
>>>
>>> Presumably, this is when the link comes up before suspend, so the PHY
>>> has been configured to allow the RX clock to be stopped prior to suspend
>>>
>>>> [  389.086185] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
>>>> [  392.863744] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
>>>
>>> Presumably, as this is after resume, this is again when the link comes
>>> up (that's the only time that stmmac calls phy_eee_rx_clock_stop().)
>>>
>>>> After this change I see ...
>>>>
>>>> [    4.644614] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
>>>> [    4.679224] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
>>>> [  191.219828] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
>>>
>>> To me, this looks no different - the PHY was configured for clock stop
>>> before suspending in both cases.
>>>
>>> However, something else to verify with the old code - after boot and the
>>> link comes up (so you get the second phy_eee_rx_clock_stop() at 7s),
>>> try unplugging the link and re-plugging it. Then try suspending.
>>
>> I still need to try this but I am still not back to the office to get to this.
>>   > The point of this test is to verify whether the PHY ignores changes to
>>> the RX clock stop configuration while the link is up.
>>>
>>>
>>>
>>> The next stage is to instrument dwmac4_set_eee_mode(),
>>> dwmac4_reset_eee_mode() and dwmac4_set_eee_lpi_entry_timer() to print
>>> the final register values in each function vs dwmac4_set_lpi_mode() in
>>> the new code. Also, I think instrumenting stmmac_common_interrupt() to
>>> print a message when we get either CORE_IRQ_TX_PATH_IN_LPI_MODE or
>>> CORE_IRQ_TX_PATH_EXIT_LPI_MODE indicating a change in LPI state would
>>> be a good idea.
>>>
>>> I'd like to see how this all ties up with suspend, resume, link up
>>> and down events, so please don't trim the log so much.
>>
>> I have been testing on top of v6.14-rc2 which does not have
>> dwmac4_set_lpi_mode(). However, instrumenting the other functions,
>> for a bad case I see ...
>>
>> [  477.494226] PM: suspend entry (deep)
>> [  477.501869] Filesystems sync: 0.006 seconds
>> [  477.504518] Freezing user space processes
>> [  477.509067] Freezing user space processes completed (elapsed 0.001 seconds)
>> [  477.514770] OOM killer disabled.
>> [  477.517940] Freezing remaining freezable tasks
>> [  477.523449] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
>> [  477.566870] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
>> [  477.586423] dwc-eth-dwmac 2490000.ethernet eth0: disable EEE
>> [  477.592052] dwmac4_set_eee_lpi_entry_timer: entered
>> [  477.596997] dwmac4_set_eee_lpi_entry_timer: GMAC4_LPI_CTRL_STATUS 0x0
>> [  477.680193] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down
> 
> This tells me WoL is not enabled, and thus phylink_suspend() did a
> phylink_stop() which took the link administratively down and disabled
> LPI at the MAC. The actual physical link on the media may still be up
> at this point, and the remote end may still signal LPI to the local
> PHY.
> 
> ...system suspends and resumes...
>> [  477.876778] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
>> [  477.883556] CPU5 is up
> 
> stmmac_resume() gets called here, which will call into phylink_resume()
> and, because WoL wasn't used at suspend time, will call phylink_start()
> which immediately prints:
> 
>> [  477.985628] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> 
> and then it allows the phylink resolver to run in a separate workqueue.
> The output from the phylink resolver thread, I'll label as "^WQ".
> Messages from the thread that called stmmac_resume() I'll labell with
> "^RES".
> 
>> [  477.993771] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_mac_enable_tx_lpi: tx_lpi_timer 1000000
> 	^WQ
> 
> At this point, the workqueue has called mac_link_up() and this indicates
> that that method has completed and it's now calling mac_enable_tx_lpi().
> In other words, the transmitter and receiver have been enabled here!
> This is key...
> 
>> [  478.171396] dwmac4: Master AXI performs any burst length
> 	^RES
> 
> dwmac4_dma_axi(), which is called from stmmac_init_dma_engine() which
> then goes on to call stmmac_reset(). As noted above, however, the
> MAC has had its transmitter and receiver enabled at this point, so
> hitting the hardware with a reset probably undoes all that.
> stmmac_init_dma_engine() is called from stmmac_hw_setup() and
> stmmac_resume() _after_ calling phylink_resume().
> 
>> [  478.174480] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
> 	^RES
> 
> Printed by stmmac_safety_feat_configuration() which is called from
> stmmac_hw_setup().
> 
>> [  478.181934] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
> 	^RES
> 
> Printed by stmmac_init_ptp() called from stmmac_hw_setup().
> 
>> [  478.202977] dwmac4_set_eee_lpi_entry_timer: entered
> 	^WQ
>> [  478.207918] dwmac4_set_eee_lpi_entry_timer: GMAC4_LPI_CTRL_STATUS 0xf4240
> 	^WQ
>> [  478.287646] dwc-eth-dwmac 2490000.ethernet eth0: Energy-Efficient Ethernet initialized
> 	^WQ
>> [  478.295538] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> 	^WQ
> 
> So clearly the phylink resolver is racing with the rest of the stmmac
> resume path - which doesn't surprise me in the least. I believe I raised
> the fact that calling phylink_resume() before the hardware was ready to
> handle link-up is a bad idea precisely because of races like this.
> 
> The reason stmmac does this is because of it's quirk that it needs the
> receive clock from the PHY in order for stmmac_reset() to work.


I do see the reset fail infrequently on previous kernels with this 
device and when it does I see these messages ...

  dwc-eth-dwmac 2490000.ethernet: Failed to reset the dma
  dwc-eth-dwmac 2490000.ethernet eth0: stmmac_hw_setup: DMA engine
   initialization failed

>> For a good case I see ...
>>
>> [   28.548472] PM: suspend entry (deep)
>> [   28.560503] Filesystems sync: 0.010 seconds
>> [   28.563622] Freezing user space processes
>> [   28.567838] Freezing user space processes completed (elapsed 0.001 seconds)
>> [   28.573380] OOM killer disabled.
>> [   28.576563] Freezing remaining freezable tasks
>> [   28.582100] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
>> [   28.627180] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
>> [   28.646770] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down
> 
> Same as above...
> 
> ...system suspends and resumes...
>> [   29.099556] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
>> [   29.106351] CPU5 is up
>> [   29.218549] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> 	^RES
>> [   29.234190] dwmac4: Master AXI performs any burst length
> 	^RES
>> [   29.237263] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
> 	^RES
>> [   29.244732] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
> 	^RES
>> [   29.306981] Restarting tasks ... done.
>> [   29.311423] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   29.314095] random: crng reseeded on system resumption
>> [   29.321404] PM: suspend exit
>> [   29.370286] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   29.429655] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   29.496567] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   32.968855] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
> 	^WQ
>> [   32.974779] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_mac_link_up: tx_lpi_timer 1000000
> 	^WQ
>> [   32.988755] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> 	^WQ
> 
> So here, phylink_resolve() runs later.
> 
> I think if you run this same test with an earlier kernel, you'll get
> much the same random behaviour, maybe with different weightings on
> "success" and "failure" because of course the code has changed - but
> only because that's caused a change in timings of the already present
> race.
> 
>> The more I have been testing, the more I feel that this is timing
>> related. In good cases, I see the MAC link coming up well after the
>> PHY. Even with your change I did see suspend work on occassion and
>> when it does I see ...
>>
>> [   79.775977] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
>> [   79.784196] dwmac4: Master AXI performs any burst length
>> [   79.787280] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
>> [   79.794736] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
>> [   79.816642] usb-conn-gpio 3520000.padctl:ports:usb2-0:connector: repeated role: device
>> [   79.820437] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
>> [   79.854481] OOM killer enabled.
>> [   79.855372] Restarting tasks ... done.
>> [   79.859460] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   79.861297] random: crng reseeded on system resumption
>> [   79.869773] PM: suspend exit
>> [   79.914909] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   79.974322] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   80.041236] VDDIO_SDMMC3_AP: voltage operation not allowed
>> [   83.547730] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_mac_enable_tx_lpi: tx_lpi_timer 1000000
>> [   83.566859] dwmac4_set_eee_lpi_entry_timer: entered
>> [   83.571782] dwmac4_set_eee_lpi_entry_timer: GMAC4_LPI_CTRL_STATUS 0xf4240
>> [   83.651520] dwc-eth-dwmac 2490000.ethernet eth0: Energy-Efficient Ethernet initialized
>> [   83.659425] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
>>
>> On a good case, the stmmac_mac_enable_tx_lpi call always happens
>> much later. It seems that after this change it is more often
>> that the link is coming up sooner and I guess probably too soon.
>> May be we were getting lucky before?
> 
> I think this is pure lottery.

Yes it does appear to be.

>> Anyway, I made the following change for testing and this is
>> working ...
>>
>> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> index b34ebb916b89..44187e230a1e 100644
>> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
>> @@ -7906,16 +7906,6 @@ int stmmac_resume(struct device *dev)
>>                          return ret;
>>          }
>> -       rtnl_lock();
>> -       if (device_may_wakeup(priv->device) && priv->plat->pmt) {
>> -               phylink_resume(priv->phylink);
>> -       } else {
>> -               phylink_resume(priv->phylink);
>> -               if (device_may_wakeup(priv->device))
>> -                       phylink_speed_up(priv->phylink);
>> -       }
>> -       rtnl_unlock();
>> -
>>          rtnl_lock();
>>          mutex_lock(&priv->lock);
>> @@ -7930,6 +7920,13 @@ int stmmac_resume(struct device *dev)
>>          stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
>> +       if (device_may_wakeup(priv->device) && priv->plat->pmt) {
>> +               phylink_resume(priv->phylink);
>> +       } else {
>> +               phylink_resume(priv->phylink);
>> +               if (device_may_wakeup(priv->device))
>> +                       phylink_speed_up(priv->phylink);
>> +       }
>>          stmmac_enable_all_queues(priv);
>>          stmmac_enable_all_dma_irq(priv);
>>
>> I noticed that in __stmmac_open() the phylink_start() is
>> called after stmmac_hw_setup and stmmac_init_coalesce, where
>> as in stmmac_resume, phylink_resume() is called before these.
>> I am not saying that this is correct in any way, but seems
>> to indicate that the PHY is coming up too soon (at least for
>> this device). I have ran 100 suspend iterations with the above
>> and I have not seen any failures.
>>
>> Let me know if you have any thoughts on this.
> 
> With my phylink-maintainer hat on, this is definitely the correct
> solution - maybe even moving the phylink_resume() call later.
> phylink_resume() should only be called when the driver is prepared
> to handle and cope with an immediate call to the mac_link_up()
> method, and it's clear that its current placement is such that the
> driver isn't prepared for that.
> 
> However... see:
> 
> 36d18b5664ef ("net: stmmac: start phylink instance before stmmac_hw_setup()")
> 
> but I also questioned this in:
> 
> https://lore.kernel.org/netdev/20210903080147.GS22278@shell.armlinux.org.uk/
> 
> see the bottom of that email starting "While reading stmmac_resume(), I
> have to question the placement of this code block:". The response was:
> 
> "There is a story here, SNPS EQOS IP need PHY provides RXC clock for
> MAC's receive logic, so we need phylink_start() to bring PHY link up,
> that make PHY resume back, PHY could stop RXC clock when in suspended
> state. This is the reason why calling phylink_start() before re-config
> MAC."
> 
> However, in 21d9ba5bc551 ("net: phylink: add PHY_F_RXC_ALWAYS_ON to PHY
> dev flags") and associated patches, I added a way that phylib can be
> told that the MAC requires the RXC at all times.
> 
> Romain Gantois arranged for this flag to always be set for stmmac in
> commit 58329b03a595 ("net: stmmac: Signal to PHY/PCS drivers to keep RX
> clock on"), which will pass PHY_F_RXC_ALWAYS_ON to the PHY driver.
> Whether the PHY driver honours this flag or not depends on which
> driver is used.
> 
> So, my preference would be to move phylink_resume() later, removing
> the race condition. If there's any regressions, then we need to
> _properly_ solve them by ensuring that the PHY keeps the RX clock
> running by honouring PHY_F_RXC_ALWAYS_ON. That's going to need
> everyone to test their stmmac platforms to find all the cases that
> need fixing...


Thanks for the in-depth analysis and feedback. We have 3 SoCs that use 
this driver and so I will do some testing with this change on all of them.

Thanks again.
Jon

-- 
nvpublic