[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220321100226.GA19177@wunner.de>
Date: Mon, 21 Mar 2022 11:02:26 +0100
From: Lukas Wunner <lukas@...ner.de>
To: Andrew Lunn <andrew@...n.ch>
Cc: Oleksij Rempel <o.rempel@...gutronix.de>,
Oliver Neukum <oneukum@...e.com>,
Oleksij Rempel <linux@...pel-privat.de>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
Heiner Kallweit <hkallweit1@...il.com>
Subject: Re: ordering of call to unbind() in usbnet_disconnect
On Tue, Mar 15, 2022 at 02:28:32PM +0100, Andrew Lunn wrote:
> > > > It was linked to unregistered/freed netdev. This is why my patch
> > > > changing the order to call phy_disconnect() first and then
> > > > unregister_netdev().
> > >
> > > Unregistered yes, but freed no. Here's the order before 2c9d6c2b871d:
> > >
> > > usbnet_disconnect()
> > > unregister_netdev()
> > > ax88772_unbind()
> > > phy_disconnect()
> > > free_netdev()
> > >
> > > Is it illegal to disconnect a PHY from an unregistered, but not yet freed
> > > net_device?
>
> There are drivers which unregistering and then calling
> phy_disconnect. In general that should be a valid pattern. But more
> MAC drivers actually connect the PHY on open and disconnect it on
> close. So it is less well used.
It turns out that unregistering a net_device and then calling
phy_disconnect() may indeed crash and is thus not permitted
right now:
Oleksij added the following code comment with commit e532a096be0e
("net: usb: asix: ax88772: add phylib support"):
/* On unplugged USB, we will get MDIO communication errors and the
* PHY will be set in to PHY_HALTED state.
So the USB adapter gets unplugged, access to MII registers fails with
-ENODEV, phy_error() is called, phy_state_machine() transitions to
PHY_HALTED and performs the following call:
phy_state_machine()
phy_link_down()
phy_link_change()
netif_carrier_off()
linkwatch_fire_event()
Asynchronously, usbnet_disconnect() calls phy_detach() and then
free_netdev().
A bit later, linkwatch_event() runs and tries to access the freed
net_device, leading to the crash that Oleksij posted upthread.
The fact that linkwatch_fire_event() increments the refcount doesn't
help because unregister_netdevice() has already run (it waits for
the refcount to drop to 1).
My suggestion would be to amend unregister_netdevice() to set
dev->phydev->attached_dev = NULL. It may also be a good idea
to WARN_ON() in free_netdev() if the refcount is not 1.
Andrew, please clarify whether you really think that the
"unregister netdev, then detach phy" order should be supported.
If you do think that it should be supported, we'll have to litter
phylib with NULL pointer checks for attached_dev. If you don't
want that, we should at least document that it's an illegal pattern.
Even if you decide that we should rather declare this pattern
illegal rather than littering phylib with NULL pointer checks,
I strongly recommend that at least unregister_netdevice() sets
dev->phydev->attached_dev = NULL. That will cause oopses which
are easier to debug than complex races like the one witnessed
by Oleksij.
Side note: Since e532a096be0e, ax88772_stop() directly accesses
phydev->state. I wonder whether that's legal. I'm under the
impression that the state is internal to phylib.
Another side note: The commit message of 2c9d6c2b871d is poor.
It should have contained a stack trace and a clear explanation.
It took me days of staring at code to reverse-engineer what
went wrong here.
Thanks,
Lukas
Powered by blists - more mailing lists