lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <855ed817-03da-4e6f-9c4a-898edfcafedd@linux.intel.com>
Date: Tue, 4 Mar 2025 13:20:55 +0200
From: Mathias Nyman <mathias.nyman@...ux.intel.com>
To: Michal Pecio <michal.pecio@...il.com>,
 Mathias Nyman <mathias.nyman@...el.com>,
 Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Cc: linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH] usb: xhci: Fix host controllers "dying" after suspend and
 resume

On 4.3.2025 9.51, Michal Pecio wrote:
> A recent cleanup went a bit too far and dropped clearing the cycle bit
> of link TRBs, so it stays different from the rest of the ring half of
> the time. Then a race occurs: if the xHC reaches such link TRB before
> more commands are queued, the link's cycle bit uintentionally matches
> the xHC's cycle so it follows the link and waits for further commands.
> If more commands are queued before the xHC gets there, inc_enq() flips
> the bit so the xHC later sees a mismatch and stops executing commands.
> 
> This function is called before suspend and 50% of times after resuming
> the xHC is doomed to get stuck sooner or later. Then some Stop Endpoint
> command fails to complete in 5 seconds and this shows up
> 
> xhci_hcd 0000:00:10.0: xHCI host not responding to stop endpoint command
> xhci_hcd 0000:00:10.0: xHCI host controller not responding, assume dead
> xhci_hcd 0000:00:10.0: HC died; cleaning up
> 
> followed by loss of all USB decives on the affected bus. That's if you
> are lucky, because if Set Deq gets stuck instead, the failure is silent.
> 
> Likely responsible for kernel bug 219824. I found this while searching
> for possible causes of that regression and reproduced it locally before
> hearing back from the reporter. To repro, simply wait for link cycle to
> become set (debugfs), then suspend, resume and wait. To accelerate the
> failure I used a script which repeatedly starts and stops a UVC camera.
> 
> Some HCs get fully reinitialized on resume and they are not affected.
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=219824
> Fixes: 36b972d4b7ce ("usb: xhci: improve xhci_clear_command_ring()")

Very nice debugging, did not suspect or consider that.

Thanks
Mathias


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ