linux-kernel - Re: [RFC PATCH] usb: host: xhci: Release spinlock before xhci

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5e55d9ba-6b4f-4d92-bb47-04b4a68328d2@linux.intel.com>
Date: Fri, 7 Nov 2025 15:07:03 +0200
From: Mathias Nyman <mathias.nyman@...ux.intel.com>
To: Uttkarsh Aggarwal <uttkarsh.aggarwal@....qualcomm.com>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
 wesley.cheng@....qualcomm.com
Subject: Re: [RFC PATCH] usb: host: xhci: Release spinlock before
 xhci_handshake in command ring abort

On 11/6/25 11:57, Uttkarsh Aggarwal wrote:
> 
> 
> On 10/22/2025 6:19 PM, Mathias Nyman wrote:
>> On 10/22/25 13:00, Uttkarsh Aggarwal wrote:
>>> Currently xhci_handshake is a polling loop that waits for change of state.
>>> If this loop is executed while holding a spinlock with IRQs disabled, it
>>> can block interrupts for up to 5 seconds.
>>>
>>> To prevent prolonged IRQ disable durations that may lead to watchdog
>>> timeouts, release the spinlock before invoking xhci_handshake() in
>>> xhci_abort_cmd_ring().
>>>
>>> The spinlock is reacquired after the handshake to continue with controller
>>> halt and recovery if needed.
>>
>> Is this a real issue?
>>
>> It should be extremely rare that commands need to be aborted, and even
>> less likely that it takes five seconds to stop the command ring.
>>
>> Can you take logs and traces of this (if it's reproducible).
>> I suspect there is some other issue that needs to be fixed.
>>
>> I agree that keeping the spin lock for up to five seconds isn't a good idea, but
>> just unlocking before the command ring has stopped opens up new race risks.
>>
>> We need to ensure that queuing a new command mid ring abortion, before ring stopped,
>> doesn't cause new issues, or that handling command completion events, including the
>> "stop command ring" event, before polling for a stopped ring works fine.
>>
> 
> Hi Mathias,
> 
> Yes, actually when we do usb headset connected, no audio playing, resume the device and do some operations like volume +/-, there will be xhci_handle_command_timeout.
> 
> logs:
> 
> It’s wdog bark, we can see : xhci_handle_command_timeout kicked in since 68346, but not end till 68356 dog bark.
> 
> [68346.109071][T25562] [2025:09:04 23:19:02](2)[25562:kworker/2:3]lwz debug: xhci_handle_command_timeout start
> [68346.109081][T25562] [2025:09:04 23:19:02](2)[25562:kworker/2:3]lwz debug: xhci_handle_command_timeout xhci_abort_cmd_ring start
> [68346.109083][T25562] [2025:09:04 23:19:02](2)[25562:kworker/2:3]lwz debug: xhci_abort_cmd_ring start
> [68346.109084][T25562] [2025:09:04 23:19:02](2)[25562:kworker/2:3]lwz debug: xhci_abort_cmd_ring xhci_trb_virt_to_dma
> [68346.109087][T25562] [2025:09:04 23:19:02](2)[25562:kworker/2:3]lwz debug: xhci_abort_cmd_ring xhci_handshake_check_state start
> ….
> [68356.215787][    C1] [2025:09:04 23:19:10](1)msm_watchdog 17600000.qcom,wdt: Causing a QCOM Apps Watchdog bite!
> [68356.215790][    C1] [2025:09:04 23:19:10](1)[RB/E]rb_sreason_str_set: sreason_str set wdog_bite
> 
> Call Trace:
> xhci_configure_endpoint hang in below call stack.
> 
> Stack:
> __switch_to+0x1a8
> schedule+0xacc
> schedule_timeout+0x4c
> wait_for_common+0xc0
> wait_for_completion+0x18
> xhci_configure_endpoint+0xe4
> xhci_check_bandwidth+0x1a8
> usb_hcd_alloc_bandwidth+0x37c
> usb_disable_device_endpoints+0xb0
> usb_disable_device+0x30
> usb_disconnect+0xd0
> hub_event+0xc4c
> process_scheduled_works+0x208
> worker_thread+0x1c0
> kthread+0xfc
> ret_from_fork+0x10
> 
> Finally crash on dog bite.
> 
> sdei_wdg_bite_cb[qcom_sdei]+0x110
> sdei_event_handler+0x3c
> do_sdei_event+0x88
> __sdei_handler+0x48
> __sdei_asm_handler+0x110
> __const_udelay+0x148
> xhci_handshake+0x58
> xhci_handle_command_timeout+0x248
> process_scheduled_works+0x208
> worker_thread+0x1c0
> kthread+0xfc
> ret_from_fork+0x10
> 
> Based on the logs and crash stacks I have collected so far, including the watchdog bark and the prolonged execution of xhci_handle_command_timeout,
> it seems likely that there's an underlying issue causing the ring to hang during xhci_configure_endpoint.
> 
> Thanks,
> Uttkarsh,
> 

Thanks Uttkarsh

Best guess is that this is somehow related to xHC suspend/resume as xHC and command processing
are stopped during suspend. xhci_suspend() also wipes the command ring clean of any pending
command TRBs without any warning.

A suspended host could also return 0xffffffff on register read, which in turn would cause
xhci_handshake(reg, CMD_RING_RUNNING, 0, 5sec) to run for full five seconds unsuccessfully
waiting for the  "command ring running" bit to clear.

Could you add xhci dynamic debug and tracing and reproduce this?
It could show how we end up in this situation.

mount -t debugfs none /sys/kernel/debug
echo 'module xhci_hcd =p' >/sys/kernel/debug/dynamic_debug/control
echo 'module usbcore =p' >/sys/kernel/debug/dynamic_debug/control
echo 81920 > /sys/kernel/debug/tracing/buffer_size_kb
echo 1 > /sys/kernel/debug/tracing/events/xhci-hcd/enable
echo 1 > /sys/kernel/debug/tracing/tracing_on
< Reproduce issue >
Send output of dmesg
Send content of /sys/kernel/debug/tracing/trace

I'll also write a testpatch with extra debugging related to command ring

Thanks
Mathias