linux-kernel - Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250520003201.57f12dff@foxbook>
Date: Tue, 20 May 2025 00:32:01 +0200
From: Michał Pecio <michal.pecio@...il.com>
To: Udipto Goswami <udipto.goswami@....qualcomm.com>
Cc: Mathias Nyman <mathias.nyman@...ux.intel.com>, Roy Luo
 <royluo@...gle.com>, mathias.nyman@...el.com, quic_ugoswami@...cinc.com,
 Thinh.Nguyen@...opsys.com, gregkh@...uxfoundation.org,
 linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement
 xhci_handshake_check_state() helper"

On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote:
> Hi Mathias,
> 
> From what I recall, we saw this issue coming up on our QCOM mobile
> platforms but it was not consistent. It was only reported in long runs
> i believe. The most recent instance when I pushed this patch was with
> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> xhci_handshake() polling read timeout upon xhci remove.

Was it some system-wide watchdog, i.e. unrelated tasks were locking up?

It looks similar to that command abort freeze: xhci_resume() calls
xhci_reset() under xhci->lock, and xhci_handshake() spins for a few
seconds with the spinlock held. Anything else (workers, IRQs) trying
to grab the lock will also spin and delay unrelated things.

Not sure why your commit message says "Use this helper in places where
xhci_handshake is called unlocked and has a long timeout", because you
end up calling it from two places where the lock is (incorrectly) held.
That's why adding the early bailout helped, I guess.

> Unfortunately I was not able to simulate the scenario for more
> granular testing and had validated it with long hours stress testing.

Looking at xhci_resume(), it will call xhci_reset() if the controller
has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming.

I guess you could simulate this case by forcing the quirk with a module
parameter and adding some extra delay to xhci_handshake(), so you are
not dependent on the hardware actually failing in any manner.

> Full call stack on core6:
> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> -001|xhci_handshake(inline)
> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> 0xFFFFFF890F619C10)
> -006|rpm_callback(inline)
> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> -008|pm_runtime_get_sync(inline)
> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
> -009|platform_remove([X19] _dev = 0xFFFFFF890F619C10)
> -010|device_remove(inline)