netdev - Re: ordering of call to unbind() in usbnet

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Yj8KnE5BeEK1SXDP@lunn.ch>
Date:   Sat, 26 Mar 2022 13:44:12 +0100
From:   Andrew Lunn <andrew@...n.ch>
To:     Lukas Wunner <lukas@...ner.de>
Cc:     Oleksij Rempel <o.rempel@...gutronix.de>,
        Oliver Neukum <oneukum@...e.com>,
        Oleksij Rempel <linux@...pel-privat.de>,
        "netdev@...r.kernel.org" <netdev@...r.kernel.org>,
        Heiner Kallweit <hkallweit1@...il.com>
Subject: Re: ordering of call to unbind() in usbnet_disconnect

On Sat, Mar 26, 2022 at 01:25:52PM +0100, Lukas Wunner wrote:
> On Tue, Mar 15, 2022 at 12:38:41PM +0100, Oleksij Rempel wrote:
> > On Tue, Mar 15, 2022 at 09:32:34AM +0100, Lukas Wunner wrote:
> > > > > > > On Thu, Mar 10, 2022 at 12:25:08PM +0100, Oliver Neukum wrote:
> > > > > > > > I got bug reports that 2c9d6c2b871d ("usbnet: run unbind() before
> > > > > > > > unregister_netdev()") is causing regressions.
> > > 
> > > Is it illegal to disconnect a PHY from an unregistered, but not yet freed
> > > net_device?
> > > 
> > > Oleksij, the commit message of 2c9d6c2b871d says that disconnecting the
> > > PHY "fails" in that situation.  Please elaborate what the failure looked
> > > like.  Did you get a stacktrace?
> 
> Oleksij, I cannot reproduce your stacktrace (included in full length below).
> I've tested with kernel 5.13 (since the stacktrace was with 5.13-rc3)
> with all your (and other people's) asix patches applied on top,
> except for 2c9d6c2b871d.  Tried unplugging an AX88772A multiple times,
> never got a stacktrace.
> 
> I've also walked down the code paths from usbnet_disconnect() and cannot
> see how the stacktrace could occur.
> 
> Normally an unregistering netdev is removed from the linkwatch event list
> (lweventlist) via this call stack:
> 
>           usbnet_disconnect()
>             unregister_netdev()
>               rtnl_unlock()
>                 netdev_run_todo()
>                   netdev_wait_allrefs()
>                     linkwatch_forget_dev()
>                       linkwatch_do_dev()
> 
> For the stacktrace to occur, the netdev would have to be subsequently
> re-added to the linkwatch event list via linkwatch_fire_event().

Hi Lukas

What you might be missing is a call to phy_error()
 
> That is called, among other places, from netif_carrier_off().  However,
> netif_carrier_off() is already called *before* linkwatch_forget_dev()
> when unregister_netdev() stops the netdev before unregistering it:
> 
>           usbnet_disconnect()
>             unregister_netdev()
>               unregister_netdevice()
>                 unregister_netdevice_queue(dev, NULL)
>                   unregister_netdevice_many()
>                     dev_close_many()
>                       __dev_close_many()
>                         usbnet_stop()
>                           ax88772_stop()
>                             phy_stop() # state = PHY_HALTED
>                               phy_state_machine()

I'm guessing somewhere around here:

If it calls into the PHY driver, and the PHY calls for an MDIO bus
transaction, and that returns an error, -ENODEV or -EIO for example,
because the USB device has gone away, and that results in a call to
phy_error().

void phy_error(struct phy_device *phydev)
{
        WARN_ON(1);

        mutex_lock(&phydev->lock);
        phydev->state = PHY_HALTED;
        mutex_unlock(&phydev->lock);

        phy_trigger_machine(phydev);
}

That will trigger the PHY state machine to run again, asynchronously.

The end of phy_stop() says:

        /* Cannot call flush_scheduled_work() here as desired because
         * of rtnl_lock(), but PHY_HALTED shall guarantee irq handler
         * will not reenable interrupts.
         */

so it looks like the state machine will run again, and potentially use
netdev.

If the MDIO bus driver is no longer returning ENODEV, than we should
avoid this.

      Andrew