netdev - Re: [PATCH v2] usbnet: fix kernel crash after disconnect

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Mon, 20 May 2019 20:09:22 -0400 (EDT)
From:   David Miller <davem@...emloft.net>
To:     Jan.Kloetzke@...h.de
Cc:     oneukum@...e.com, jan@...etzke.net, netdev@...r.kernel.org,
        linux-usb@...r.kernel.org
Subject: Re: [PATCH v2] usbnet: fix kernel crash after disconnect

From: Kloetzke Jan <Jan.Kloetzke@...h.de>
Date: Thu, 16 May 2019 07:10:30 +0000

> Am Montag, den 06.05.2019, 10:17 +0200 schrieb Oliver Neukum:
>> On So, 2019-05-05 at 00:45 -0700, David Miller wrote:
>> > From: Kloetzke Jan <Jan.Kloetzke@...h.de>
>> > Date: Tue, 30 Apr 2019 14:15:07 +0000
>> > 
>> > > @@ -1431,6 +1432,11 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
>> > >               spin_unlock_irqrestore(&dev->txq.lock, flags);
>> > >               goto drop;
>> > >       }
>> > > +     if (WARN_ON(netif_queue_stopped(net))) {
>> > > +             usb_autopm_put_interface_async(dev->intf);
>> > > +             spin_unlock_irqrestore(&dev->txq.lock, flags);
>> > > +             goto drop;
>> > > +     }
>> > 
>> > If this is known to happen and is expected, then we should not warn.
>> > 
>> 
>> yes this is the point. Can ndo_start_xmit() and ndo_stop() race?
>> If not, why does the patch fix the observed issue and what
>> prevents the race? Something is not clear here.
> 
> Dave, could you shed some light on Olivers question? If the race can
> happen then we can stick to v1 because the WARN_ON is indeed pointless.
> Otherwise it's not clear why it made the problem go away for us and v2
> may be the better option...

Yes I think they can race.   ->ndo_stop() executes and stops the queue,
then we get an RCU grace period so that all parallel executions of
->ndo_start_xmit() complete.

But I wonder, this can probably cause problems because some drivers have
"stop queue and re-check" logic, f.e. in drivers/net/tg3.c we have:

	if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
		netif_tx_stop_queue(txq);

		/* netif_tx_stop_queue() must be done before checking
		 * checking tx index in tg3_tx_avail() below, because in
		 * tg3_tx(), we update tx index before checking for
		 * netif_tx_queue_stopped().
		 */
		smp_mb();
		if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi))
			netif_tx_wake_queue(txq);
	}

which in the racey scenerio would undo ->ndo_stop()'s work which is
completely unexpected.

Hmmm...