linux-kernel - Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAMTwNXBkAVjwaERAu-UHEHmH-BNe7T3iRfntLw+076g1OWgrPA@mail.gmail.com>
Date: Tue, 20 May 2025 18:00:44 +0530
From: Udipto Goswami <udipto.goswami@....qualcomm.com>
To: Michał Pecio <michal.pecio@...il.com>
Cc: Mathias Nyman <mathias.nyman@...ux.intel.com>, Roy Luo <royluo@...gle.com>,
        mathias.nyman@...el.com, quic_ugoswami@...cinc.com,
        Thinh.Nguyen@...opsys.com, gregkh@...uxfoundation.org,
        linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
        stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check_state()
 helper"

On Tue, May 20, 2025 at 4:02 AM Michał Pecio <michal.pecio@...il.com> wrote:
>
> On Mon, 19 May 2025 23:43:21 +0530, Udipto Goswami wrote:
> > Hi Mathias,
> >
> > From what I recall, we saw this issue coming up on our QCOM mobile
> > platforms but it was not consistent. It was only reported in long runs
> > i believe. The most recent instance when I pushed this patch was with
> > platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> > xhci_handshake() polling read timeout upon xhci remove.
>
> Was it some system-wide watchdog, i.e. unrelated tasks were locking up?
Hi Michal,
Not exactly, I could see the other tasks were not stuck, only the
readl which we do as part of xhci_handshake with a 10 sec timer.
Our watchdog barks out at 10 sec and we saw in that timeframe it
didn't respond i.e the readl_poll_timeout_atomic  was still polling.
Since the timer is exactly aligned to the Watchdog timer here
therefore it crashed.


> It looks similar to that command abort freeze: xhci_resume() calls
> xhci_reset() under xhci->lock, and xhci_handshake() spins for a few
> seconds with the spinlock held. Anything else (workers, IRQs) trying
> to grab the lock will also spin and delay unrelated things.
>
> Not sure why your commit message says "Use this helper in places where
> xhci_handshake is called unlocked and has a long timeout", because you
> end up calling it from two places where the lock is (incorrectly) held.
> That's why adding the early bailout helped, I guess.
>
I think we had re-worded the patch a little, this was my original commit:

"In some situations where xhci removal happens parallel to
xhci_handshake, we enoughter a scenario where the xhci_handshake will
fails because the status does not change the entire duration of
polling. This causes the xhci_handshake to timeout resulting in long
wait which might lead to watchdog timeout." The API  handles command
timeout which may happen upon XHCI stack removal. Check for xhci state
and exit the handshake if xhci is removed.
https://lore.kernel.org/all/20230919085847.8210-1-quic_ugoswami@quicinc.com/

But yeah the main motive was to bail out handshake to get around this.

So you could say my main problem was that the CMD_RESET was stuck for
a long time.
In other cases the reset passes in very short amount of time. It was
unusual for this case.

> > Unfortunately I was not able to simulate the scenario for more
> > granular testing and had validated it with long hours stress testing.
>
> Looking at xhci_resume(), it will call xhci_reset() if the controller
> has known bugs (e.g. the RESET_ON_RESUME quirk) or it fails resuming.
>
> I guess you could simulate this case by forcing the quirk with a module
> parameter and adding some extra delay to xhci_handshake(), so you are
> not dependent on the hardware actually failing in any manner.

This would definitely bark, but like I mentioned my case was that the
CMD_RESET didn't exit the poll in the 10 sec.
I'm not sure if that points to the controller stuck at processing
something else?
Please correct me if i'm wrong here.

Thanks,
-Udipto