[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250520003201.57f12dff@foxbook>
Date: Tue, 20 May 2025 00:32:01 +0200
From: MichaĆ Pecio <michal.pecio@...il.com>
To: Udipto Goswami <udipto.goswami@....qualcomm.com>
Cc: Mathias Nyman <mathias.nyman@...ux.intel.com>, Roy Luo
<royluo@...gle.com>, mathias.nyman@...el.com, quic_ugoswami@...cinc.com,
Thinh.Nguyen@...opsys.com, gregkh@...uxfoundation.org,
linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement
xhci_handshake_check_state() helper"
On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote:
> Hi Mathias,
>
> From what I recall, we saw this issue coming up on our QCOM mobile
> platforms but it was not consistent. It was only reported in long runs
> i believe. The most recent instance when I pushed this patch was with
> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> xhci_handshake() polling read timeout upon xhci remove.
Was it some system-wide watchdog, i.e. unrelated tasks were locking up?
It looks similar to that command abort freeze: xhci_resume() calls
xhci_reset() under xhci->lock, and xhci_handshake() spins for a few
seconds with the spinlock held. Anything else (workers, IRQs) trying
to grab the lock will also spin and delay unrelated things.
Not sure why your commit message says "Use this helper in places where
xhci_handshake is called unlocked and has a long timeout", because you
end up calling it from two places where the lock is (incorrectly) held.
That's why adding the early bailout helped, I guess.
> Unfortunately I was not able to simulate the scenario for more
> granular testing and had validated it with long hours stress testing.
Looking at xhci_resume(), it will call xhci_reset() if the controller
has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming.
I guess you could simulate this case by forcing the quirk with a module
parameter and adding some extra delay to xhci_handshake(), so you are
not dependent on the hardware actually failing in any manner.
> Full call stack on core6:
> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> -001|xhci_handshake(inline)
> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> 0xFFFFFF890F619C10)
> -006|rpm_callback(inline)
> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> -008|pm_runtime_get_sync(inline)
> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
> -009|platform_remove([X19] _dev = 0xFFFFFF890F619C10)
> -010|device_remove(inline)
Powered by blists - more mailing lists