[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Za6eMg0y2QxogfmD@shell.armlinux.org.uk>
Date: Mon, 22 Jan 2024 16:56:18 +0000
From: "Russell King (Oracle)" <linux@...linux.org.uk>
To: Andrew Lunn <andrew@...n.ch>
Cc: Rafał Miłecki <zajec5@...il.com>,
Network Development <netdev@...r.kernel.org>,
Heiner Kallweit <hkallweit1@...il.com>,
Robert Marko <robimarko@...il.com>,
Ansuel Smith <ansuelsmth@...il.com>,
Daniel Golle <daniel@...rotopia.org>
Subject: Re: Race in PHY subsystem? Attaching to PHY devices before they get
probed
On Mon, Jan 22, 2024 at 03:12:42PM +0100, Andrew Lunn wrote:
> On Mon, Jan 22, 2024 at 08:09:58AM +0100, Rafał Miłecki wrote:
> > Hi!
> >
> > I have MT7988 SoC board with following problem:
> > [ 26.887979] Aquantia AQR113C mdio-bus:08: aqr107_wait_reset_complete failed: -110
> >
> > This issue is known to occur when PHY's firmware is not running. After
> > some debugging I discovered that .config_init() CB gets called while
> > .probe() CB is still being executed.
> >
> > It turns out mtk_soc_eth.c calls phylink_of_phy_connect() before my PHY
> > gets fully probed and it seems there is nothing in PHY subsystem
> > verifying that. Please note this PHY takes quite some time to probe as
> > it involves sending firmware to hardware.
> >
> > Is that a possible race in PHY subsystem?
>
> Seems like it.
>
> There is a patch "net: phylib: get rid of unnecessary locking" which
> removed locks from probe, which might of helped, but the patch also
> says:
>
> The locking in phy_probe() and phy_remove() does very little to prevent
> any races with e.g. phy_attach_direct(),
>
> suggesting it probably did not help.
The reason for that statement is because phy_attach_direct() doesn't
take phydev->lock _at all_, so taking the lock in phy_probe() has no
effect on any race with phy_attach_direct().
> I think the traditional way problems like this are avoided is that the
> device should not be visible to the rest of the system until probe has
> completed.
However, we have the problem of the generic driver fallback - which
phy_attach_direct() does.
The probe vs phy_attach_direct() has been racy for quite a long time,
and the poking about that's done in that function such as assigning
struct device's driver member, calling device_bind_driver() etc is
all hellishly racy if the phy_device _could_ be bound simultaneously.
Also note this... we call device_bind_driver() from phy_attach_direct(),
and the documentation for this function states:
* This function must be called with the device lock held.
which we don't do. So we're already violating the locking requirements
for the driver model.
So, I would suggest that the solution is to make use of device_lock()
which will also only return once a probe has succeeded.
However, that's still not ideal - because the fact we have a race here
means that what could happen is that phy_attach_direct() is called
a little earlier than the probe begins, and the phy device ends up
being bound to the generic PHY driver rather than its specific driver.
I think what this comes down to are the following points:
1) not using the required device model locking
2) the mere existence of the default driver makes for a race between
the PHY being attached and its driver being probed.
No amount of locking saves us from (2) - the only solutions that I can
see to this are:
1) to put up with there being such a race
2) get rid of the default drivers altogether and insist that we have
specific PHY drivers for _all_ PHYs
3) have some kind of retry mechanism
A further problem is... we can't simply return -EPROBE_DEFER from
phy_attach_direct() because this function may not be called from
probe functions - it may be called from the .ndo_open method which
has no idea how to handle a probe deferal. Moreover, returning an
error to userspace will just cause it to fail (because all errors
from trying to bring a netdev up are considered to be fatal.)
So, it's a really yucky problem, and I don't see any nice and simple
solution.
--
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
Powered by blists - more mailing lists