[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <TYZPR01MB47847CC913E498B31755C9A8D58AA@TYZPR01MB4784.apcprd01.prod.exchangelabs.com>
Date: Fri, 8 Dec 2023 17:28:20 +0800
From: Yaxiong Tian <iambestgod@...look.com>
To: Mathias Nyman <mathias.nyman@...ux.intel.com>,
mathias.nyman@...el.com
Cc: linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
tianyaxiong@...inos.cn, gregkh@...uxfoundation.org
Subject: Re: [PATCH] usb:xhci: Avoid hub_event() stuck when xHC restore state
timeout
在 2023/12/5 22:58, Mathias Nyman 写道:
> On 4.12.2023 10.02, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@...inos.cn>
>>
>> when xHc restore state timeout,the xhci_reusme() return -ETIMEDOUT
>
> Out of curiosity, have you tried if it still is possible to revive your
> xHC controller here?
>
> Instead of returning -ETIMEDOUT, try setting " reinit_xhc = true", and
> jump to "if (reinit_xhc) {" where we reinitialize xHC in
> xhci_resume() due
> to other resume issues.
Yes it is works.
>
>> instantly. After usb_hc_died() called ,they kick hub_wq to running
>> hub_event() but the wq is freezd. When suspend ends,hub_evnet realy
>> running and sticking.
>> Such as:
>> [ 968.794016][ 2] [ T37] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [ 968.802969][ 2] [ T37] kworker/2:3 D 0 999 2
>> 0x00000028
>> [ 968.809579][ 2] [ T37] Workqueue: usb_hub_wq hub_event
>> [ 968.814885][ 2] [ T37] Call trace:
>> [ 968.818455][ 2] [ T37] __switch_to+0xd4/0x138
>> [ 968.823067][ 2] [ T37] __schedule+0x2dc/0x6a0
>> [ 968.827680][ 2] [ T37] schedule+0x34/0xb0
>> [ 968.831947][ 2] [ T37] schedule_timeout+0x1e0/0x298
>> [ 968.837079][ 2] [ T37] __wait_for_common+0xf0/0x208
>> [ 968.842212][ 2] [ T37] wait_for_completion+0x1c/0x28
>> [ 968.847432][ 2] [ T37] xhci_configure_endpoint+0x104/0x640
>> [ 968.853173][ 2] [ T37] xhci_check_bandwidth+0x140/0x2e0
>> [ 968.858652][ 2] [ T37] usb_hcd_alloc_bandwidth+0x1c8/0x348
>> [ 968.864393][ 2] [ T37] usb_disable_device+0x198/0x260
>> [ 968.869698][ 2] [ T37] usb_disconnect+0xdc/0x3a0
>> [ 968.874571][ 2] [ T37] usb_disconnect+0xbc/0x3a0
>> [ 968.879441][ 2] [ T37] hub_quiesce+0xa0/0x108
>> [ 968.884053][ 2] [ T37] hub_event+0x4d4/0x1558
>> [ 968.888664][ 2] [ T37] kretprobe_trampoline+0x0/0xc4
>> [ 968.893884][ 2] [ T37] worker_thread+0x4c/0x488
>> [ 968.898668][ 2] [ T37] kthread+0xf8/0x128
>> [ 968.902933][ 2] [ T37] ret_from_fork+0x10/0x18
>>
>> The result is that you cannot suspend again.because the wq can't
>> be freezed.Also hard to reboot,when some application visited this
>> piece.
>>
>> The reason of stuck is that some access related to xhci hardware
>> is being called.But xhci has problem,at least not running.(
>> when xhci_restore_registers(),the xhci will load op_regs.The
>> CMD_RUN will clear in xhci_suspend().)
>>
>
> Nice catch and debugging work btw.
>
Thanks, I further debug found that the command is Infinite
retransmission and xHc is
halted. So the __wait_for_common() never complete,because xHc is halted
and others
don't kown.
Note:For fast reproduction I force return -ETIMEDOUT by setting
if (1 || xhci_handshake(&xhci->op_regs->status,STS_RESTORE, 0, 100
* 1000)) )
[ 131.024013][ 3] [ T135] xhci_hcd 0000:06:00.0: Command timeout,
USBSTS: 0x00000001 HCHalted
[ 131.032447][ 3] [ T135] xhci_hcd 0000:06:00.0: Command timeout on
stopped ring
[ 131.039745][ 3] [ T135] xhci_hcd 0000:06:00.0: Turn aborted command
00000000620a076c to no-op
[ 131.048346][ 3] [ T135] xhci_hcd 0000:06:00.0: // Ding dong
>> So using XHCI_STATE_DYING flag,to avoid any code to touching
>> hardware immediately.hub_event() will complete.The usb_hc_died
>> tasks will be completed and some sys interfaces will be removed.
>
> The XHCI_STATE_DYING flag is currently only set in xhci_hc_died().
> So when this flag is set we could assume that the command ring and
> pending URBs are, or will be cleaned up. This would change with your
> patch.
>
I agree it.
> We might need some other solution,
> Maybe set the set_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags) after
> CNR (Contorller Not Ready) is successfully cleared, and controller is
> actually accessible.
>
> We then would need to add checks to see if controller is accessible
> before queuing any commands to xHC hardware.
>
I found another fix based on the debugging results。In xhci_suspend() set
xhci->xhc_state |= XHCI_STATE_HALTED and in xhci_resume() clear it.
like it.
------>8-----
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 08fbabff23a0..f364958c237a 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -928,6 +928,8 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
spin_unlock_irq(&xhci->lock);
return -ETIMEDOUT;
}
+ xhci->xhc_state |= XHCI_STATE_HALTED;
+
xhci_clear_command_ring(xhci);
/* step 3: save registers */
@@ -1119,8 +1121,14 @@ int xhci_resume(struct xhci_hcd *xhci,
pm_message_t msg)
command = readl(&xhci->op_regs->command);
command |= CMD_RUN;
writel(command, &xhci->op_regs->command);
- xhci_handshake(&xhci->op_regs->status, STS_HALT,
- 0, 250 * 1000);
+ if(xhci_handshake(&xhci->op_regs->status, STS_HALT,
+ 0, 250 * 1000)){
+ xhci_warn(xhci, "WARN: xHC CMD_RUN timeout\n");
+ spin_unlock_irq(&xhci->lock);
+ return -ETIMEDOUT;
+ };
+
+ xhci->xhc_state &= ~XHCI_STATE_HALTED;
/* step 5: walk topology and initialize portsc,
* portpmsc and portli
----8<-------------
> Thanks
> Mathias
Thanks
Powered by blists - more mailing lists