linux-kernel - Re: [PATCH] usb:xhci: Avoid hub_event() stuck when xHC restore state timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <TYZPR01MB47847CC913E498B31755C9A8D58AA@TYZPR01MB4784.apcprd01.prod.exchangelabs.com>
Date:   Fri, 8 Dec 2023 17:28:20 +0800
From:   Yaxiong Tian <iambestgod@...look.com>
To:     Mathias Nyman <mathias.nyman@...ux.intel.com>,
        mathias.nyman@...el.com
Cc:     linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
        tianyaxiong@...inos.cn, gregkh@...uxfoundation.org
Subject: Re: [PATCH] usb:xhci: Avoid hub_event() stuck when xHC restore state
 timeout


在 2023/12/5 22:58, Mathias Nyman 写道:
> On 4.12.2023 10.02, Yaxiong Tian wrote:
>> From: Yaxiong Tian <tianyaxiong@...inos.cn>
>>
>> when xHc restore state timeout,the xhci_reusme() return -ETIMEDOUT
>
> Out of curiosity, have you tried if it still is possible to revive your
> xHC controller here?
>
> Instead of returning -ETIMEDOUT, try setting " reinit_xhc = true", and
> jump to  "if (reinit_xhc) {"  where we reinitialize xHC in 
> xhci_resume() due
> to other resume issues.
Yes it is works.
>
>> instantly. After usb_hc_died() called ,they kick hub_wq to running
>> hub_event() but the wq is freezd. When suspend ends,hub_evnet realy
>> running and sticking.
>> Such as:
>> [  968.794016][ 2] [   T37] "echo 0 > 
>> /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [  968.802969][ 2] [   T37] kworker/2:3     D    0   999      2 
>> 0x00000028
>> [  968.809579][ 2] [   T37] Workqueue: usb_hub_wq hub_event
>> [  968.814885][ 2] [   T37] Call trace:
>> [  968.818455][ 2] [   T37]  __switch_to+0xd4/0x138
>> [  968.823067][ 2] [   T37]  __schedule+0x2dc/0x6a0
>> [  968.827680][ 2] [   T37]  schedule+0x34/0xb0
>> [  968.831947][ 2] [   T37]  schedule_timeout+0x1e0/0x298
>> [  968.837079][ 2] [   T37]  __wait_for_common+0xf0/0x208
>> [  968.842212][ 2] [   T37]  wait_for_completion+0x1c/0x28
>> [  968.847432][ 2] [   T37]  xhci_configure_endpoint+0x104/0x640
>> [  968.853173][ 2] [   T37]  xhci_check_bandwidth+0x140/0x2e0
>> [  968.858652][ 2] [   T37]  usb_hcd_alloc_bandwidth+0x1c8/0x348
>> [  968.864393][ 2] [   T37]  usb_disable_device+0x198/0x260
>> [  968.869698][ 2] [   T37]  usb_disconnect+0xdc/0x3a0
>> [  968.874571][ 2] [   T37]  usb_disconnect+0xbc/0x3a0
>> [  968.879441][ 2] [   T37]  hub_quiesce+0xa0/0x108
>> [  968.884053][ 2] [   T37]  hub_event+0x4d4/0x1558
>> [  968.888664][ 2] [   T37]  kretprobe_trampoline+0x0/0xc4
>> [  968.893884][ 2] [   T37]  worker_thread+0x4c/0x488
>> [  968.898668][ 2] [   T37]  kthread+0xf8/0x128
>> [  968.902933][ 2] [   T37]  ret_from_fork+0x10/0x18
>>
>> The result is that you cannot suspend again.because the wq can't
>> be freezed.Also hard to reboot,when some application visited this
>> piece.
>>
>> The reason of stuck is that some access related to xhci hardware
>> is being called.But xhci has problem,at least not running.(
>> when xhci_restore_registers(),the xhci will load op_regs.The
>> CMD_RUN will clear in xhci_suspend().)
>>
>
> Nice catch and debugging work btw.
>
Thanks, I further debug found that the command is Infinite 
retransmission and xHc is
halted. So the __wait_for_common() never complete,because xHc is halted 
and others
don't kown.
Note:For fast reproduction  I force  return -ETIMEDOUT  by setting
      if (1 || xhci_handshake(&xhci->op_regs->status,STS_RESTORE, 0, 100 
* 1000)) ）

[  131.024013][ 3] [  T135] xhci_hcd 0000:06:00.0: Command timeout, 
USBSTS: 0x00000001 HCHalted
[  131.032447][ 3] [  T135] xhci_hcd 0000:06:00.0: Command timeout on 
stopped ring
[  131.039745][ 3] [  T135] xhci_hcd 0000:06:00.0: Turn aborted command 
00000000620a076c to no-op
[  131.048346][ 3] [  T135] xhci_hcd 0000:06:00.0: // Ding dong

>> So using XHCI_STATE_DYING flag,to avoid any code to touching
>> hardware immediately.hub_event() will complete.The usb_hc_died
>> tasks will be completed and some sys interfaces will be removed.
>
> The XHCI_STATE_DYING flag is currently only set in xhci_hc_died().
> So when this flag is set we could assume that the command ring and
> pending URBs are, or will be cleaned up. This would change with your 
> patch.
>
I agree it.
> We might need some other solution,
> Maybe set the set_bit(HCD_FLAG_HW_ACCESSIBLE, &hcd->flags) after
> CNR (Contorller Not Ready) is successfully cleared, and controller is
> actually accessible.
>
> We then would need to add checks to see if controller is accessible
> before queuing any commands to xHC hardware.
>
I found another fix based on the debugging results。In xhci_suspend() set
xhci->xhc_state |= XHCI_STATE_HALTED and in xhci_resume() clear it.
like it.

------>8-----
diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c
index 08fbabff23a0..f364958c237a 100644
--- a/drivers/usb/host/xhci.c
+++ b/drivers/usb/host/xhci.c
@@ -928,6 +928,8 @@ int xhci_suspend(struct xhci_hcd *xhci, bool do_wakeup)
          spin_unlock_irq(&xhci->lock);
          return -ETIMEDOUT;
      }
+    xhci->xhc_state |= XHCI_STATE_HALTED;
+
      xhci_clear_command_ring(xhci);

      /* step 3: save registers */
@@ -1119,8 +1121,14 @@ int xhci_resume(struct xhci_hcd *xhci, 
pm_message_t msg)
      command = readl(&xhci->op_regs->command);
      command |= CMD_RUN;
      writel(command, &xhci->op_regs->command);
-    xhci_handshake(&xhci->op_regs->status, STS_HALT,
-          0, 250 * 1000);
+    if(xhci_handshake(&xhci->op_regs->status, STS_HALT,
+          0, 250 * 1000)){
+        xhci_warn(xhci, "WARN: xHC CMD_RUN timeout\n");
+        spin_unlock_irq(&xhci->lock);
+        return -ETIMEDOUT;
+    };
+
+    xhci->xhc_state &= ~XHCI_STATE_HALTED;

      /* step 5: walk topology and initialize portsc,
       * portpmsc and portli
----8<-------------

> Thanks
> Mathias
Thanks