netdev - Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20180206110012.GJ9418@n2100.armlinux.org.uk>
Date:   Tue, 6 Feb 2018 11:00:13 +0000
From:   Russell King - ARM Linux <linux@...linux.org.uk>
To:     Heiner Kallweit <hkallweit1@...il.com>
Cc:     Florian Fainelli <f.fainelli@...il.com>,
        Andrew Lunn <andrew@...n.ch>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>
Subject: Re: Potential issue with f5e64032a799 "net: phy: fix resume handling"

On Mon, Feb 05, 2018 at 10:48:55PM +0100, Heiner Kallweit wrote:
> Am 04.02.2018 um 03:48 schrieb Florian Fainelli:
> > 
> > 
> > On 02/03/2018 03:58 PM, Heiner Kallweit wrote:
> >> Am 03.02.2018 um 21:17 schrieb Andrew Lunn:
> >>> On Sat, Feb 03, 2018 at 05:41:54PM +0100, Heiner Kallweit wrote:
> >>>> This commit forces callers of phy_resume() and phy_suspend() to hold
> >>>> mutex phydev->lock. This was done for calls to phy_resume() and
> >>>> phy_suspend() in phylib, however there are more callers in network
> >>>> drivers. I'd assume that these other calls issue a warning now
> >>>> because of the lock not being held.
> >>>> So is there something I miss or would this have to be fixed?
> >>>
> >>> Hi Heiner
> >>>
> >>> This is a good point.
> >>>
> >>> Yes, it looks like some fixes are needed. But what exactly?
> >>>
> >>> The phy state machine will suspend and resume the phy is you call
> >>> phy_stop() and phy_start() in the MAC suspend and resume functions.
> >>>
> >> AFAICS phy_stop() doesn't suspend the PHY. It just sets the state
> >> to PHY_HALTED and (at least if we're not in polling mode) doesn't
> >> call phy_suspend(). Maybe a call to phy_trigger_machine() is
> >> needed like in phy_start() ? Then the state machine would call
> >> phy_suspend(), provided the link is still up.
> > 
> > Right, phy_stop() merely just moves the state machine to PHY_HALTED and
> > this is actually a great source of problems which I tried to address here:
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg196061.html
> > 
> > because phy_stop() is not a synchronous call, so when it returns the
> > state machine might still be running (it can take up to a 1 HZ depending
> > on when you called phy_stop()) and so if you took that as a
> > synchronization point to e.g: turn of your Ethernet MAC/MDIO bus clocks,
> > you will likely see problems. phy_stop_machine() would provide that
> > synchronization point, but is not currently exported, despite being a
> > global symbol. This patch series above is all well and good, except that
> > Geert reported issues with suspend/resume interactions which I have not
> > been able to track down.
> > 
> > We should most definitively try to consolidate the different PHY
> > suspend/resume within the Ethernet MAC suspend/resume implementation and
> > document exactly what the appropriate behavior must be under the
> > following circumstances:
> > 
> > - when to call phy_stop() + phy_stop_machine()
> > - when to call phy_suspend() (if the network interface does do not WoL)
> > - when to call phy_resume() (if needed, actually, it usually is not)
> > - when to call phy_start()
> > 
> 
> I think phy_start() / phy_start_machine() / phy_start_interrupts()
> belong together and we may call the latter two functions from phy_start().
> Same for stop.
> 
> This would mean:
> - Remove call to phy_start_interrupts() from phy_connect_direct()
> - Call phy_start_machine() and phy_start_interrupts() from phy_start()
> - mdio_bus_phy_suspend() calls phy_stop()
> Same for stop, plus: phy_error() calls phy_stop().
> 
> In this setup a second call to phy_stop() wouldn't hurt because state
> is PHY_HALTED already and phy_stop() is a no-op.
> 
> A functional change would be that interrupts are disabled during system
> suspend (except WoL because we don't suspend the PHY is this case). 
> 
> These are first thoughts and therefore it's fine if you totally disagree ..
> I didn't test this yet, it's only a "Gedankenexperiment" so far.
> 
> When talking about suspend/resume I think we talked about system suspend /
> resume. However I think we need to consider also runtime pm.
> If a link is down the network driver may decide to runtime-suspend the PHY
> (power it down). In case of runtime pm I'd say we need to keep irq and
> workqueue active to be able to react if a cable is plugged in and the PHY
> wakes up automatically and establishes a link.

Maybe a better solution now would be to restore phy_resume()'s lock-
taking behaviour, and provide a lockless __phy_resume() which can be
used internally within phylib.  This means drivers using phy_resume()
would see no change.  Maybe something like (untested):

 drivers/net/phy/phy.c        |  2 +-
 drivers/net/phy/phy_device.c | 17 ++++++++++++++---
 include/linux/phy.h          |  1 +
 3 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index f3313a129531..4574d02dce93 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -819,7 +819,7 @@ void phy_start(struct phy_device *phydev)
 		break;
 	case PHY_HALTED:
 		/* if phy was suspended, bring the physical link up again */
-		phy_resume(phydev);
+		__phy_resume(phydev);
 
 		/* make sure interrupts are re-enabled for the PHY */
 		if (phydev->irq != PHY_POLL) {
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index b13eed21c87d..ae0f9306bbdc 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -136,7 +136,7 @@ static int mdio_bus_phy_resume(struct device *dev)
 		goto no_resume;
 
 	mutex_lock(&phydev->lock);
-	ret = phy_resume(phydev);
+	ret = __phy_resume(phydev);
 	mutex_unlock(&phydev->lock);
 	if (ret < 0)
 		return ret;
@@ -1042,7 +1042,7 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 		goto error;
 
 	mutex_lock(&phydev->lock);
-	phy_resume(phydev);
+	__phy_resume(phydev);
 	mutex_unlock(&phydev->lock);
 	phy_led_triggers_register(phydev);
 
@@ -1172,7 +1172,7 @@ int phy_suspend(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_suspend);
 
-int phy_resume(struct phy_device *phydev)
+int __phy_resume(struct phy_device *phydev)
 {
 	struct phy_driver *phydrv = to_phy_driver(phydev->mdio.dev.driver);
 	int ret = 0;
@@ -1189,6 +1189,17 @@ int phy_resume(struct phy_device *phydev)
 
 	return ret;
 }
+
+int phy_resume(struct phy_device *phydev)
+{
+	int ret;
+
+	mutex_lock(&phydev->lock);
+	ret = __phy-resume(phydev);
+	mutex_unlock(&phydev->lock);
+
+	return ret;
+}
 EXPORT_SYMBOL(phy_resume);
 
 int phy_loopback(struct phy_device *phydev, bool enable)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 5a0c3e53e7c2..8f82bd64f82d 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -923,6 +923,7 @@ static inline void phy_device_free(struct phy_device *phydev) { }
 void phy_device_remove(struct phy_device *phydev);
 int phy_init_hw(struct phy_device *phydev);
 int phy_suspend(struct phy_device *phydev);
+int __phy_resume(struct phy_device *phydev);
 int phy_resume(struct phy_device *phydev);
 int phy_loopback(struct phy_device *phydev, bool enable);
 struct phy_device *phy_attach(struct net_device *dev, const char *bus_id,


-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up