netdev - Re: [PATCH net-next 9/9] net: stmmac: convert to phylink managed EEE support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z7X6Z8yLMsQ1wa2D@shell.armlinux.org.uk>
Date: Wed, 19 Feb 2025 15:36:07 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Jon Hunter <jonathanh@...dia.com>
Cc: Andrew Lunn <andrew@...n.ch>, Heiner Kallweit <hkallweit1@...il.com>,
	Alexandre Torgue <alexandre.torgue@...s.st.com>,
	Andrew Lunn <andrew+netdev@...n.ch>,
	Bryan Whitehead <bryan.whitehead@...rochip.com>,
	"David S. Miller" <davem@...emloft.net>,
	Eric Dumazet <edumazet@...gle.com>,
	Jakub Kicinski <kuba@...nel.org>,
	linux-arm-kernel@...ts.infradead.org,
	linux-stm32@...md-mailman.stormreply.com,
	Marcin Wojtas <marcin.s.wojtas@...il.com>,
	Maxime Coquelin <mcoquelin.stm32@...il.com>, netdev@...r.kernel.org,
	Paolo Abeni <pabeni@...hat.com>, UNGLinuxDriver@...rochip.com,
	"linux-tegra@...r.kernel.org" <linux-tegra@...r.kernel.org>
Subject: Re: [PATCH net-next 9/9] net: stmmac: convert to phylink managed EEE
 support

On Wed, Feb 19, 2025 at 02:01:34PM +0000, Jon Hunter wrote:
> 
> On 14/02/2025 11:21, Russell King (Oracle) wrote:
> > On Fri, Feb 14, 2025 at 10:58:55AM +0000, Jon Hunter wrote:
> > > Thanks for the feedback. So ...
> > > 
> > > 1. I can confirm that suspend works if I disable EEE via ethtool
> > > 2. Prior to this change I do see phy_eee_rx_clock_stop being called
> > >     to enable the clock resuming from suspend, but after this change
> > >     it is not.
> > > 
> > > Prior to this change I see (note the prints around 389-392 are when
> > > we resume from suspend) ...
> > > 
> > > [    4.654454] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 0
> > 
> > This is a bug in phylink - it shouldn't have been calling
> > phy_eee_rx_clock_stop() where a MAC doesn't support phylink managed EEE.
> > 
> > > [    4.723123] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> > > [    7.629652] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
> > 
> > Presumably, this is when the link comes up before suspend, so the PHY
> > has been configured to allow the RX clock to be stopped prior to suspend
> > 
> > > [  389.086185] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> > > [  392.863744] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
> > 
> > Presumably, as this is after resume, this is again when the link comes
> > up (that's the only time that stmmac calls phy_eee_rx_clock_stop().)
> > 
> > > After this change I see ...
> > > 
> > > [    4.644614] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
> > > [    4.679224] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> > > [  191.219828] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> > 
> > To me, this looks no different - the PHY was configured for clock stop
> > before suspending in both cases.
> > 
> > However, something else to verify with the old code - after boot and the
> > link comes up (so you get the second phy_eee_rx_clock_stop() at 7s),
> > try unplugging the link and re-plugging it. Then try suspending.
> 
> I still need to try this but I am still not back to the office to get to this.
>  > The point of this test is to verify whether the PHY ignores changes to
> > the RX clock stop configuration while the link is up.
> > 
> > 
> > 
> > The next stage is to instrument dwmac4_set_eee_mode(),
> > dwmac4_reset_eee_mode() and dwmac4_set_eee_lpi_entry_timer() to print
> > the final register values in each function vs dwmac4_set_lpi_mode() in
> > the new code. Also, I think instrumenting stmmac_common_interrupt() to
> > print a message when we get either CORE_IRQ_TX_PATH_IN_LPI_MODE or
> > CORE_IRQ_TX_PATH_EXIT_LPI_MODE indicating a change in LPI state would
> > be a good idea.
> > 
> > I'd like to see how this all ties up with suspend, resume, link up
> > and down events, so please don't trim the log so much.
> 
> I have been testing on top of v6.14-rc2 which does not have
> dwmac4_set_lpi_mode(). However, instrumenting the other functions,
> for a bad case I see ...
> 
> [  477.494226] PM: suspend entry (deep)
> [  477.501869] Filesystems sync: 0.006 seconds
> [  477.504518] Freezing user space processes
> [  477.509067] Freezing user space processes completed (elapsed 0.001 seconds)
> [  477.514770] OOM killer disabled.
> [  477.517940] Freezing remaining freezable tasks
> [  477.523449] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [  477.566870] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
> [  477.586423] dwc-eth-dwmac 2490000.ethernet eth0: disable EEE
> [  477.592052] dwmac4_set_eee_lpi_entry_timer: entered
> [  477.596997] dwmac4_set_eee_lpi_entry_timer: GMAC4_LPI_CTRL_STATUS 0x0
> [  477.680193] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down

This tells me WoL is not enabled, and thus phylink_suspend() did a
phylink_stop() which took the link administratively down and disabled
LPI at the MAC. The actual physical link on the media may still be up
at this point, and the remote end may still signal LPI to the local
PHY.

...system suspends and resumes...
> [  477.876778] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
> [  477.883556] CPU5 is up

stmmac_resume() gets called here, which will call into phylink_resume()
and, because WoL wasn't used at suspend time, will call phylink_start()
which immediately prints:

> [  477.985628] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode

and then it allows the phylink resolver to run in a separate workqueue.
The output from the phylink resolver thread, I'll label as "^WQ".
Messages from the thread that called stmmac_resume() I'll labell with
"^RES".

> [  477.993771] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_mac_enable_tx_lpi: tx_lpi_timer 1000000
	^WQ

At this point, the workqueue has called mac_link_up() and this indicates
that that method has completed and it's now calling mac_enable_tx_lpi().
In other words, the transmitter and receiver have been enabled here!
This is key...

> [  478.171396] dwmac4: Master AXI performs any burst length
	^RES

dwmac4_dma_axi(), which is called from stmmac_init_dma_engine() which
then goes on to call stmmac_reset(). As noted above, however, the
MAC has had its transmitter and receiver enabled at this point, so
hitting the hardware with a reset probably undoes all that.
stmmac_init_dma_engine() is called from stmmac_hw_setup() and
stmmac_resume() _after_ calling phylink_resume().

> [  478.174480] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
	^RES

Printed by stmmac_safety_feat_configuration() which is called from
stmmac_hw_setup().

> [  478.181934] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
	^RES

Printed by stmmac_init_ptp() called from stmmac_hw_setup().

> [  478.202977] dwmac4_set_eee_lpi_entry_timer: entered
	^WQ
> [  478.207918] dwmac4_set_eee_lpi_entry_timer: GMAC4_LPI_CTRL_STATUS 0xf4240
	^WQ
> [  478.287646] dwc-eth-dwmac 2490000.ethernet eth0: Energy-Efficient Ethernet initialized
	^WQ
> [  478.295538] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
	^WQ

So clearly the phylink resolver is racing with the rest of the stmmac
resume path - which doesn't surprise me in the least. I believe I raised
the fact that calling phylink_resume() before the hardware was ready to
handle link-up is a bad idea precisely because of races like this.

The reason stmmac does this is because of it's quirk that it needs the
receive clock from the PHY in order for stmmac_reset() to work.

> For a good case I see ...
> 
> [   28.548472] PM: suspend entry (deep)
> [   28.560503] Filesystems sync: 0.010 seconds
> [   28.563622] Freezing user space processes
> [   28.567838] Freezing user space processes completed (elapsed 0.001 seconds)
> [   28.573380] OOM killer disabled.
> [   28.576563] Freezing remaining freezable tasks
> [   28.582100] Freezing remaining freezable tasks completed (elapsed 0.001 seconds)
> [   28.627180] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
> [   28.646770] dwc-eth-dwmac 2490000.ethernet eth0: Link is Down

Same as above...

...system suspends and resumes...
> [   29.099556] CPU5: Booted secondary processor 0x0000000103 [0x411fd073]
> [   29.106351] CPU5 is up
> [   29.218549] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
	^RES
> [   29.234190] dwmac4: Master AXI performs any burst length
	^RES
> [   29.237263] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
	^RES
> [   29.244732] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
	^RES
> [   29.306981] Restarting tasks ... done.
> [   29.311423] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   29.314095] random: crng reseeded on system resumption
> [   29.321404] PM: suspend exit
> [   29.370286] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   29.429655] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   29.496567] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   32.968855] Broadcom BCM89610 stmmac-0:00: phy_eee_rx_clock_stop: clk_stop_enable 1
	^WQ
> [   32.974779] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_mac_link_up: tx_lpi_timer 1000000
	^WQ
> [   32.988755] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
	^WQ

So here, phylink_resolve() runs later.

I think if you run this same test with an earlier kernel, you'll get
much the same random behaviour, maybe with different weightings on
"success" and "failure" because of course the code has changed - but
only because that's caused a change in timings of the already present
race.

> The more I have been testing, the more I feel that this is timing
> related. In good cases, I see the MAC link coming up well after the
> PHY. Even with your change I did see suspend work on occassion and
> when it does I see ...
> 
> [   79.775977] dwc-eth-dwmac 2490000.ethernet eth0: configuring for phy/rgmii link mode
> [   79.784196] dwmac4: Master AXI performs any burst length
> [   79.787280] dwc-eth-dwmac 2490000.ethernet eth0: No Safety Features support found
> [   79.794736] dwc-eth-dwmac 2490000.ethernet eth0: IEEE 1588-2008 Advanced Timestamp supported
> [   79.816642] usb-conn-gpio 3520000.padctl:ports:usb2-0:connector: repeated role: device
> [   79.820437] tegra-xusb 3530000.usb: Firmware timestamp: 2020-07-06 13:39:28 UTC
> [   79.854481] OOM killer enabled.
> [   79.855372] Restarting tasks ... done.
> [   79.859460] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   79.861297] random: crng reseeded on system resumption
> [   79.869773] PM: suspend exit
> [   79.914909] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   79.974322] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   80.041236] VDDIO_SDMMC3_AP: voltage operation not allowed
> [   83.547730] dwc-eth-dwmac 2490000.ethernet eth0: stmmac_mac_enable_tx_lpi: tx_lpi_timer 1000000
> [   83.566859] dwmac4_set_eee_lpi_entry_timer: entered
> [   83.571782] dwmac4_set_eee_lpi_entry_timer: GMAC4_LPI_CTRL_STATUS 0xf4240
> [   83.651520] dwc-eth-dwmac 2490000.ethernet eth0: Energy-Efficient Ethernet initialized
> [   83.659425] dwc-eth-dwmac 2490000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
> 
> On a good case, the stmmac_mac_enable_tx_lpi call always happens
> much later. It seems that after this change it is more often
> that the link is coming up sooner and I guess probably too soon.
> May be we were getting lucky before?

I think this is pure lottery.

> Anyway, I made the following change for testing and this is
> working ...
> 
> diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> index b34ebb916b89..44187e230a1e 100644
> --- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> +++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
> @@ -7906,16 +7906,6 @@ int stmmac_resume(struct device *dev)
>                         return ret;
>         }
> -       rtnl_lock();
> -       if (device_may_wakeup(priv->device) && priv->plat->pmt) {
> -               phylink_resume(priv->phylink);
> -       } else {
> -               phylink_resume(priv->phylink);
> -               if (device_may_wakeup(priv->device))
> -                       phylink_speed_up(priv->phylink);
> -       }
> -       rtnl_unlock();
> -
>         rtnl_lock();
>         mutex_lock(&priv->lock);
> @@ -7930,6 +7920,13 @@ int stmmac_resume(struct device *dev)
>         stmmac_restore_hw_vlan_rx_fltr(priv, ndev, priv->hw);
> +       if (device_may_wakeup(priv->device) && priv->plat->pmt) {
> +               phylink_resume(priv->phylink);
> +       } else {
> +               phylink_resume(priv->phylink);
> +               if (device_may_wakeup(priv->device))
> +                       phylink_speed_up(priv->phylink);
> +       }
>         stmmac_enable_all_queues(priv);
>         stmmac_enable_all_dma_irq(priv);
> 
> I noticed that in __stmmac_open() the phylink_start() is
> called after stmmac_hw_setup and stmmac_init_coalesce, where
> as in stmmac_resume, phylink_resume() is called before these.
> I am not saying that this is correct in any way, but seems
> to indicate that the PHY is coming up too soon (at least for
> this device). I have ran 100 suspend iterations with the above
> and I have not seen any failures.
> 
> Let me know if you have any thoughts on this.

With my phylink-maintainer hat on, this is definitely the correct
solution - maybe even moving the phylink_resume() call later.
phylink_resume() should only be called when the driver is prepared
to handle and cope with an immediate call to the mac_link_up()
method, and it's clear that its current placement is such that the
driver isn't prepared for that.

However... see:

36d18b5664ef ("net: stmmac: start phylink instance before stmmac_hw_setup()")

but I also questioned this in:

https://lore.kernel.org/netdev/20210903080147.GS22278@shell.armlinux.org.uk/

see the bottom of that email starting "While reading stmmac_resume(), I
have to question the placement of this code block:". The response was:

"There is a story here, SNPS EQOS IP need PHY provides RXC clock for
MAC's receive logic, so we need phylink_start() to bring PHY link up,
that make PHY resume back, PHY could stop RXC clock when in suspended
state. This is the reason why calling phylink_start() before re-config
MAC."

However, in 21d9ba5bc551 ("net: phylink: add PHY_F_RXC_ALWAYS_ON to PHY
dev flags") and associated patches, I added a way that phylib can be
told that the MAC requires the RXC at all times.

Romain Gantois arranged for this flag to always be set for stmmac in
commit 58329b03a595 ("net: stmmac: Signal to PHY/PCS drivers to keep RX
clock on"), which will pass PHY_F_RXC_ALWAYS_ON to the PHY driver.
Whether the PHY driver honours this flag or not depends on which
driver is used.

So, my preference would be to move phylink_resume() later, removing
the race condition. If there's any regressions, then we need to
_properly_ solve them by ensuring that the PHY keeps the RX clock
running by honouring PHY_F_RXC_ALWAYS_ON. That's going to need
everyone to test their stmmac platforms to find all the cases that
need fixing...

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!