lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CA+zupgxkvm9HxG4Aj1avPA-ZgjVxmg3T3GtbfnV=rXk9P7-pFQ@mail.gmail.com>
Date: Thu, 22 May 2025 12:19:40 -0700
From: Roy Luo <royluo@...gle.com>
To: Mathias Nyman <mathias.nyman@...el.com>
Cc: Mathias Nyman <mathias.nyman@...ux.intel.com>, 
	Udipto Goswami <udipto.goswami@....qualcomm.com>, quic_ugoswami@...cinc.com, 
	Thinh.Nguyen@...opsys.com, gregkh@...uxfoundation.org, michal.pecio@...il.com, 
	linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org, 
	stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check_state()
 helper"

On Thu, May 22, 2025 at 5:24 AM Mathias Nyman <mathias.nyman@...el.com> wrote:
>
> On 22.5.2025 5.21, Roy Luo wrote:
> >>>> Udipto Goswami, can you recall the platforms that needed this workaroud?
> >>>> and do we have an easy way to detect those?
> >>>
> >>> Hi Mathias,
> >>>
> >>>   From what I recall, we saw this issue coming up on our QCOM mobile
> >>> platforms but it was not consistent. It was only reported in long runs
> >>> i believe. The most recent instance when I pushed this patch was with
> >>> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> >>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> >>> I was not able to simulate the scenario for more granular testing and
> >>> had validated it with long hours stress testing.
> >>> The callstack was like so:
> >>>
> >>> Full call stack on core6:
> >>> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> >>> -001|xhci_handshake(inline)
> >>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> >>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> >>> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> >>> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> >>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> >>> 0xFFFFFF890F619C10)
> >>> -006|rpm_callback(inline)
> >>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> >>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> >>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> >>> -008|pm_runtime_get_sync(inline)
> >>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
> >>
> >> Thank you for clarifying this.
> >>
> >> So patch avoids the long timeout by always cutting xhci reinit path short in
> >> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
> >> xhci_plat_remove()
> >>
> >> void xhci_plat_remove(struct platform_device *dev)
> >> {
> >>          xhci->xhc_state |= XHCI_STATE_REMOVING;
> >>          pm_runtime_get_sync(&dev->dev);
> >>          ...
> >> }
> >>
> >> I think we can revert this patch, and just make sure that we don't reset the
> >> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
> >> Just return immediately instead.
> >>
> >
> > Just to be sure, are you proposing that we skip xhci_reset() within
> > the reinit path
> > of xhci_resume()? If we do that, could that lead to issues with
> > subsequent operations
> > in the reinit sequence, such as xhci_init() or xhci_run()?
>
> I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set.
>
> This should be similar to what is going on already.
>
> xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset
> completes extremely fast. xhci_resume() bails out if xhci_reset() returns error:
>
> xhci_resume()
>    ...
>    if (power_lost) {
>      ...
>      retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
>      spin_unlock_irq(&xhci->lock);
>      if (retval)
>        return retval;
> >
> > Do you prefer to group the change to skip xhci_reset() within the
> > reinit path together
> > with this revert? or do you want it to be sent and reviewed separately?
>
> First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set
> and we are in the reinit (power_lost) path about to call xhci_reset();
>
> Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement
> xhci_handshake_check_state()
>
> Does this sound reasonable?
>
> should avoid the QCOM 10sec watchdog issue as next xhci_rest() called
> in xhci_remove() path has a short 250ms timeout, and ensure the
> SNPS DWC3 USB regression won't trigger.
>
> Thanks
> Mathias
>

Thanks for the clarification! SGTM.
I've sent out a new patchset accordingly
https://lore.kernel.org/linux-usb/20250522190912.457583-1-royluo@google.com/

Thanks,
Roy

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ