[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <fbd03180-cca0-4a0f-8fd9-4daf5ff28ff5@actia.se>
Date: Wed, 16 Jul 2025 14:54:46 +0000
From: John Ernberg <john.ernberg@...ia.se>
To: Jakub Kicinski <kuba@...nel.org>, Oliver Neukum <oneukum@...e.com>
CC: Andrew Lunn <andrew+netdev@...n.ch>, "David S . Miller"
<davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, Paolo Abeni
<pabeni@...hat.com>, Ming Lei <ming.lei@...onical.com>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"stable@...r.kernel.org" <stable@...r.kernel.org>
Subject: Re: [PATCH] net: usbnet: Avoid potential RCU stall on LINK_CHANGE
event
Hi Jakub, Oliver,
On 7/15/25 3:54 PM, Jakub Kicinski wrote:
> On Tue, 15 Jul 2025 07:15:51 +0000 John Ernberg wrote:
>>> I'm worried that this is still racy.
>>> Since usbnet_bh checks if carrier is ok and __handle_link_change()
>>> checks the opposite something must be out of sync if both run.
>>> Most likely something restored the carrier while we're still handling
>>> the previous carrier loss.
>>
>> There could definitely be other factors, I'll try to dig some in
>> cdc_ether and see if something there could be causing problems for the
>> usbnet core.
>> I honestly kinda stopped digging when I found unlink_urbs() being
>> wrapped with a pause/unpause at another place tied to a commit seeing a
>> similar issue.
>
> Looking at cdc_ether:
>
> static void usbnet_cdc_zte_status(struct usbnet *dev, struct urb *urb)
> [...]
> if (event->wValue &&
> netif_carrier_ok(dev->net))
> netif_carrier_off(dev->net);
>
> This looks sus. Is the Gemalto Cinterion PLS83-W ZTE based?
>
The modem isn't ZTE-based, but I did spot that function as well when
checking cdc_ether.
The modem is advertising USB High Speed with a bInterval of 5, which in
usbnet gets bumped to 7 to avoid too aggressive polling, if I read that
code right.
Through some rudimentary logging with the inlined patch on 6.12.20 (mail
client mangled):
-----------------8<----------------------
diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index a6469235d904..7b4827d2c797 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -418,8 +418,10 @@ void usbnet_cdc_status(struct usbnet *dev, struct
urb *urb)
case USB_CDC_NOTIFY_NETWORK_CONNECTION:
netif_dbg(dev, timer, dev->net, "CDC: carrier %s\n",
event->wValue ? "on" : "off");
- if (netif_carrier_ok(dev->net) != !!event->wValue)
+ if (netif_carrier_ok(dev->net) != !!event->wValue) {
+ netdev_err(dev->net, "network connection %d\n",
(int)event->wValue);
usbnet_link_change(dev, !!event->wValue, 0);
+ }
break;
case USB_CDC_NOTIFY_SPEED_CHANGE: /* tx/rx rates */
netif_dbg(dev, timer, dev->net, "CDC: speed change (len
%d)\n",
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 1e99a6ef4945..dc3e3d87b4f7 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -721,6 +721,7 @@ static int unlink_urbs (struct usbnet *dev, struct
sk_buff_head *q)
struct sk_buff *skb;
int count = 0;
+ netdev_err(dev->net, "unlink urb start: %d devflags=%lx\n",
skb_queue_len_lockless(q), dev->flags);
spin_lock_irqsave (&q->lock, flags);
while (!skb_queue_empty(q)) {
struct skb_data *entry;
@@ -757,6 +758,7 @@ static int unlink_urbs (struct usbnet *dev, struct
sk_buff_head *q)
spin_lock_irqsave(&q->lock, flags);
}
spin_unlock_irqrestore (&q->lock, flags);
+ netdev_err(dev->net, "unlink urb counted %d\n", count);
return count;
}
-----------------8<----------------------
I ended up with the following log:
[ 23.823289] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[ 23.830874] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=1880
[ 23.840148] cdc_ether 1-1.1:1.8 wwan0: unlink urb counted 5
[ 25.356741] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[ 25.364745] cdc_ether 1-1.1:1.8 wwan0: network connection 0
[ 25.371106] cdc_ether 1-1.1:1.8 wwan0: unlink urb start: 5 devflags=880
[ 25.378710] cdc_ether 1-1.1:1.8 wwan0: network connection 1
[ 51.422757] rcu: INFO: rcu_sched self-detected stall on CPU
[ 51.429081] rcu: 0-....: (6499 ticks this GP)
idle=da7c/1/0x4000000000000000 softirq=2067/2067 fqs=2668
[ 51.439717] rcu: hardirqs softirqs csw/system
[ 51.445897] rcu: number: 62096 59017 0
[ 51.452107] rcu: cputime: 0 11397 1470 ==>
12996(ms)
[ 51.459852] rcu: (t=6500 jiffies g=2397 q=663 ncpus=2)
From a USB capture where the stall didn't happen I can see:
* A bunch of CDC_NETWORK_CONNECTION events with Disconnected state (0).
* Then a CDC_NETWORK_CONNECTION event with Connected state (1) once the
WWAN interface is turned on by the modem.
* Followed by a Disconnected in the next USB INTR poll.
* Followed by a Connected in the next USB INTR poll.
(I'm not sure if I can achieve a different timing with enough captures
or a faster system)
Which makes the off and on LINK_CHANGE events race on our system (ARM64
based, iMX8QXP) as they cannot be handled fast enough. Nothing stops
usbnet_link_change() from being called while the deferred work is running.
As Oliver points out usbnet_resume_rx() causes scheduling which seems
unnecessary or maybe even inappropriate for all cases except when the
carrier was turned on during the race.
I gave the ZTE modem quirk a go anyway, despite the comment explaining a
different situation than what I am seeing, and it has no observable
effect on this RCU stall.
Currently drawing a blank on what the correct fix would be.
Best regards // John Ernberg
Powered by blists - more mailing lists