netdev - Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fbd03180-cca0-4a0f-8fd9-4daf5ff28ff5@actia.se>
Date: Wed, 16 Jul 2025 14:54:46 +0000
From: John Ernberg <john.ernberg@...ia.se>
To: Jakub Kicinski <kuba@...nel.org>, Oliver Neukum <oneukum@...e.com>
CC: Andrew Lunn <andrew+netdev@...n.ch>, "David S . Miller"
	<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Paolo Abeni
	<pabeni@...hat.com>, Ming Lei <ming.lei@...onical.com>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE
 event

Hi Jakub, Oliver,

On 7/15/25 3:54 PM, Jakub Kicinski wrote:
> On Tue, 15 Jul 2025 07:15:51 +0000 John Ernberg wrote:
>>> I'm worried that this is still racy.
>>> Since usbnet_bh checks if carrier is ok and __handle_link_change()
>>> checks the opposite something must be out of sync if both run.
>>> Most likely something restored the carrier while we're still handling
>>> the previous carrier loss.
>>
>> There could definitely be other factors, I'll try to dig some in
>> cdc_ether and see if something there could be causing problems for the
>> usbnet core.
>> I honestly kinda stopped digging when I found unlink_urbs() being
>> wrapped with a pause/unpause at another place tied to a commit seeing a
>> similar issue.
> 
> Looking at cdc_ether:
> 
> static void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
> [...]
>          if (event->wValue &&
>              netif_carrier_ok(dev->net))
>                  netif_carrier_off(dev->net);
> 
> This looks sus. Is the Gemalto Cinterion PLS83-W ZTE based?
> 

The modem isn't ZTE-based, but I did spot that function as well when 
checking cdc_ether.

The modem is advertising USB High Speed with a bInterval of 5, which in 
usbnet gets bumped to 7 to avoid too aggressive polling, if I read that 
code right.

Through some rudimentary logging with the inlined patch on 6.12.20 (mail 
client mangled):

-----------------8<----------------------

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index a6469235d904..7b4827d2c797 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -418,8 +418,10 @@ void usbnet_cdc_status(struct usbnet *dev, struct 
urb *urb)
         case USB_CDC_NOTIFY_NETWORK_CONNECTION:
                 netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
                           event->wValue ? "on" : "off");
-               if (netif_carrier_ok(dev->net) != !!event->wValue)
+               if (netif_carrier_ok(dev->net) != !!event->wValue) {
+                       netdev_err(dev->net, "network connection %d\n", 
(int)event->wValue);
                         usbnet_link_change(dev, !!event->wValue, 0);
+               }
                 break;
         case USB_CDC_NOTIFY_SPEED_CHANGE:       /* tx/rx rates */
                 netif_dbg(dev, timer, dev->net, "CDC: speed change (len 
%d)\n",
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 1e99a6ef4945..dc3e3d87b4f7 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -721,6 +721,7 @@ static int unlink_urbs (struct usbnet *dev, struct 
sk_buff_head *q)
         struct sk_buff          *skb;
         int                     count = 0;

+       netdev_err(dev->net, "unlink urb start: %d devflags=%lx\n", 
skb_queue_len_lockless(q), dev->flags);
         spin_lock_irqsave (&q->lock, flags);
         while (!skb_queue_empty(q)) {
                 struct skb_data         *entry;
@@ -757,6 +758,7 @@ static int unlink_urbs (struct usbnet *dev, struct 
sk_buff_head *q)
                 spin_lock_irqsave(&q->lock, flags);
         }
         spin_unlock_irqrestore (&q->lock, flags);
+       netdev_err(dev->net, "unlink urb counted %d\n", count);
         return count;
  }

-----------------8<----------------------

I ended up with the following log:

[   23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
[   23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
[   25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[   25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
[   25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[   51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
[   51.429081] rcu:     0-....: (6499 ticks this GP) 
idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
[   51.439717] rcu:              hardirqs   softirqs   csw/system
[   51.445897] rcu:      number:    62096      59017            0
[   51.452107] rcu:     cputime:        0      11397         1470   ==> 
12996(ms)
[   51.459852] rcu:     (t=6500 jiffies g=2397 q=663 ncpus=2)

 From a USB capture where the stall didn't happen I can see:
* A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
* Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the 
WWAN interface is turned on by the modem.
* Followed by a Disconnected in the next USB INTR poll.
* Followed by a Connected in the next USB INTR poll.
(I'm not sure if I can achieve a different timing with enough captures 
or a faster system)

Which makes the off and on LINK_CHANGE events race on our system (ARM64 
based, iMX8QXP) as they cannot be handled fast enough. Nothing stops 
usbnet_link_change() from being called while the deferred work is running.

As Oliver points out usbnet_resume_rx() causes scheduling which seems 
unnecessary or maybe even inappropriate for all cases except when the 
carrier was turned on during the race.

I gave the ZTE modem quirk a go anyway, despite the comment explaining a 
different situation than what I am seeing, and it has no observable 
effect on this RCU stall.

Currently drawing a blank on what the correct fix would be.

Best regards // John Ernberg