linux-kernel - [RFC PATCH] usb: gadget: u_ether: prevent deadlock under RT

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250219181556.1020029-1-ralph.siemsen@linaro.org>
Date: Wed, 19 Feb 2025 13:15:52 -0500
From: Ralph Siemsen <ralph.siemsen@...aro.org>
To: linux-rt-devel@...ts.linux.dev,
	linux-usb@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	Pawel Laszczak <pawell@...ence.com>,
	Frank Li <Frank.Li@....com>,
	Ferry Toth <ftoth@...londelft.nl>
Cc: Sebastian Andrzej Siewior <bigeasy@...utronix.de>,
	Thomas Gleixner <tglx@...utronix.de>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ralph Siemsen <ralph.siemsen@...aro.org>
Subject: [RFC PATCH] usb: gadget: u_ether: prevent deadlock under RT

A deadlock can be readily triggered when using NCM gadget with the
Cadence cdns3 USB driver, under heavy traffic from "iperf3 --bidir".
It has been observed under 6.1, 6.6 and 6.12 kernel versions, but
only on PREEMPT_RT. Once deadlocked, even magicsysrq has no effect.
However, there is periodic output from the rcu stall detector:

[   71.105941] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[   71.105966] rcu:     Tasks blocked on level-0 rcu_node (CPUs 0-1): P125/4:b..l
[   71.105992] rcu:     (detected by 1, t=5252 jiffies, g=-515, q=7 ncpus=2)
[   71.106003] task:irq/507-s-f4000 state:D stack:0     pid:125   tgid:125   ppid:2      flags:0x00000008
[   71.106018] Call trace:
[   71.106022]  __switch_to+0xf4/0x158
[   71.106046]  __schedule+0x2b4/0x920
[   71.106055]  schedule_rtlock+0x24/0x50
[   71.106064]  rtlock_slowlock_locked+0x348/0xcb8
[   71.106077]  rt_spin_lock+0x88/0xb8
[   71.106086]  eth_start_xmit+0x30/0x1490 [u_ether]        /*****/
[   71.106112]  ncm_tx_timeout+0x2c/0x50 [usb_f_ncm]
[   71.106131]  __hrtimer_run_queues+0x180/0x378
[   71.106143]  hrtimer_run_softirq+0x90/0x100
[   71.106151]  handle_softirqs.isra.0+0x14c/0x360
[   71.106165]  __local_bh_enable_ip+0x104/0x118
[   71.106177]  __netdev_alloc_skb+0x1e0/0x210
[   71.106192]  ncm_unwrap_ntb+0x1ec/0x528 [usb_f_ncm]
[   71.106206]  rx_complete+0x120/0x288 [u_ether]           /*****/
[   71.106221]  usb_gadget_giveback_request+0x34/0xf8
[   71.106236]  cdns3_gadget_giveback+0xe4/0x2d0 [cdns3]
[   71.106286]  cdns3_transfer_completed+0x3b0/0x630 [cdns3]
[   71.106320]  cdns3_device_thread_irq_handler+0x8b8/0xd18 [cdns3]
[   71.106353]  irq_thread_fn+0x34/0xb8
[   71.106364]  irq_thread+0x180/0x2f0
[   71.106374]  kthread+0x104/0x118
[   71.106384]  ret_from_fork+0x10/0x20

The deadlock occurs because eth_start_xmit() and rx_complete() both
acquire the same spinlock in the same instance of struct eth_dev.
The nested call occurs because rx_complete() calls __netdev_alloc_skb()
which performs a brief local_bh_disable/enable() sequence.

Prevent the deadlock by disabling softirq in rx_complete(), with the
same scope as the spinlock. With this fix, no deadlocks are observed
over many hours of continuous testing at USB 2.0 speed (480 Mbit/s).

Signed-off-by: Ralph Siemsen <ralph.siemsen@...aro.org>
---
Discussion items:

0) This is somewhat similar to the recent discussion
https://lore.kernel.org/linux-rt-devel/20250212123302.0f620f23@gandalf.local.home/
and my solution is to "sprinkle local_bh_disable() around",
which is clearly not optimal. Hence the RFC on this patch.

1) There are several more places using the same spinlock, for example
the gether_suspend() and gether_resume() functions. I'm not using power
management, but I wonder if there might be more deadlocks lurking?

2) By keeping softirq disabled for a longer duration, this patch could
potentially increase the RT latency. I tried to quantify this using
cyclictest, but I cannot get a baseline for comparison, since it
deadlocks almost immediately when the USB is active. Other suggestions?

3) The fix is in generic u_ether.c code, to match the scope of the
spinlock. Alternatively, it could be done in cdns3 specific code,
such as in cdns3_device_thread_irq_handler(). I'm not sure if the
same problem exists in other USB drivers?
---
 drivers/usb/gadget/function/u_ether.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/usb/gadget/function/u_ether.c b/drivers/usb/gadget/function/u_ether.c
index 09e2838917e2..dc511c01b741 100644
--- a/drivers/usb/gadget/function/u_ether.c
+++ b/drivers/usb/gadget/function/u_ether.c
@@ -233,6 +233,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req)
 		if (dev->unwrap) {
 			unsigned long	flags;

+			local_bh_disable();
 			spin_lock_irqsave(&dev->lock, flags);
 			if (dev->port_usb) {
 				status = dev->unwrap(dev->port_usb,
@@ -243,6 +244,7 @@ static void rx_complete(struct usb_ep *ep, struct usb_request *req)
 				status = -ENOTCONN;
 			}
 			spin_unlock_irqrestore(&dev->lock, flags);
+			local_bh_enable();
 		} else {
 			skb_queue_tail(&dev->rx_frames, skb);
 		}
-- 
2.45.2.121.gc2b3f2b3cd