[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+zupgxkvm9HxG4Aj1avPA-ZgjVxmg3T3GtbfnV=rXk9P7-pFQ@mail.gmail.com>
Date: Thu, 22 May 2025 12:19:40 -0700
From: Roy Luo <royluo@...gle.com>
To: Mathias Nyman <mathias.nyman@...el.com>
Cc: Mathias Nyman <mathias.nyman@...ux.intel.com>,
Udipto Goswami <udipto.goswami@....qualcomm.com>, quic_ugoswami@...cinc.com,
Thinh.Nguyen@...opsys.com, gregkh@...uxfoundation.org, michal.pecio@...il.com,
linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check_state()
helper"
On Thu, May 22, 2025 at 5:24 AM Mathias Nyman <mathias.nyman@...el.com> wrote:
>
> On 22.5.2025 5.21, Roy Luo wrote:
> >>>> Udipto Goswami, can you recall the platforms that needed this workaroud?
> >>>> and do we have an easy way to detect those?
> >>>
> >>> Hi Mathias,
> >>>
> >>> From what I recall, we saw this issue coming up on our QCOM mobile
> >>> platforms but it was not consistent. It was only reported in long runs
> >>> i believe. The most recent instance when I pushed this patch was with
> >>> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> >>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> >>> I was not able to simulate the scenario for more granular testing and
> >>> had validated it with long hours stress testing.
> >>> The callstack was like so:
> >>>
> >>> Full call stack on core6:
> >>> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> >>> -001|xhci_handshake(inline)
> >>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> >>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> >>> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> >>> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> >>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> >>> 0xFFFFFF890F619C10)
> >>> -006|rpm_callback(inline)
> >>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> >>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> >>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> >>> -008|pm_runtime_get_sync(inline)
> >>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
> >>
> >> Thank you for clarifying this.
> >>
> >> So patch avoids the long timeout by always cutting xhci reinit path short in
> >> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
> >> xhci_plat_remove()
> >>
> >> void xhci_plat_remove(struct platform_device *dev)
> >> {
> >> xhci->xhc_state |= XHCI_STATE_REMOVING;
> >> pm_runtime_get_sync(&dev->dev);
> >> ...
> >> }
> >>
> >> I think we can revert this patch, and just make sure that we don't reset the
> >> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
> >> Just return immediately instead.
> >>
> >
> > Just to be sure, are you proposing that we skip xhci_reset() within
> > the reinit path
> > of xhci_resume()? If we do that, could that lead to issues with
> > subsequent operations
> > in the reinit sequence, such as xhci_init() or xhci_run()?
>
> I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set.
>
> This should be similar to what is going on already.
>
> xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset
> completes extremely fast. xhci_resume() bails out if xhci_reset() returns error:
>
> xhci_resume()
> ...
> if (power_lost) {
> ...
> retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
> spin_unlock_irq(&xhci->lock);
> if (retval)
> return retval;
> >
> > Do you prefer to group the change to skip xhci_reset() within the
> > reinit path together
> > with this revert? or do you want it to be sent and reviewed separately?
>
> First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set
> and we are in the reinit (power_lost) path about to call xhci_reset();
>
> Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement
> xhci_handshake_check_state()
>
> Does this sound reasonable?
>
> should avoid the QCOM 10sec watchdog issue as next xhci_rest() called
> in xhci_remove() path has a short 250ms timeout, and ensure the
> SNPS DWC3 USB regression won't trigger.
>
> Thanks
> Mathias
>
Thanks for the clarification! SGTM.
I've sent out a new patchset accordingly
https://lore.kernel.org/linux-usb/20250522190912.457583-1-royluo@google.com/
Thanks,
Roy
Powered by blists - more mailing lists