linux-kernel - Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+zupgyU2czaczPcqavYBi=NrPqKqgp7SbrUocy0qbJ0m9np6g@mail.gmail.com>
Date: Wed, 21 May 2025 19:21:56 -0700
From: Roy Luo <royluo@...gle.com>
To: Mathias Nyman <mathias.nyman@...ux.intel.com>
Cc: Udipto Goswami <udipto.goswami@....qualcomm.com>, mathias.nyman@...el.com, 
	quic_ugoswami@...cinc.com, Thinh.Nguyen@...opsys.com, 
	gregkh@...uxfoundation.org, michal.pecio@...il.com, linux-usb@...r.kernel.org, 
	linux-kernel@...r.kernel.org, stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check_state()
 helper"

On Tue, May 20, 2025 at 9:18 AM Mathias Nyman
<mathias.nyman@...ux.intel.com> wrote:
>
> On 19.5.2025 21.13, Udipto Goswami wrote:
> > On Mon, May 19, 2025 at 6:23 PM Mathias Nyman
> > <mathias.nyman@...ux.intel.com> wrote:
> >>
> >> On 17.5.2025 7.39, Roy Luo wrote:
> >>> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
> >>>
> >>> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
> >>> helper") was introduced to workaround watchdog timeout issues on some
> >>> platforms, allowing xhci_reset() to bail out early without waiting
> >>> for the reset to complete.
> >>>
> >>> Skipping the xhci handshake during a reset is a dangerous move. The
> >>> xhci specification explicitly states that certain registers cannot
> >>> be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
> >>> Host Controller Reset (HCRST) field:
> >>> "This bit is cleared to '0' by the Host Controller when the reset
> >>> process is complete. Software cannot terminate the reset process
> >>> early by writinga '0' to this bit and shall not write any xHC
> >>> Operational or Runtime registers until while HCRST is '1'."
> >>>
> >>> This behavior causes a regression on SNPS DWC3 USB controller with
> >>> dual-role capability. When the DWC3 controller exits host mode and
> >>> removes xhci while a reset is still in progress, and then tries to
> >>> configure its hardware for device mode, the ongoing reset leads to
> >>> register access issues; specifically, all register reads returns 0.
> >>> These issues extend beyond the xhci register space (which is expected
> >>> during a reset) and affect the entire DWC3 IP block, causing the DWC3
> >>> device mode to malfunction.
> >>
> >> I agree with you and Thinh that waiting for the HCRST bit to clear during
> >> reset is the right thing to do, especially now when we know skipping it
> >> causes issues for SNPS DWC3, even if it's only during remove phase.
> >>
> >> But reverting this patch will re-introduce the issue originally worked
> >> around by Udipto Goswami, causing regression.
> >>
> >> Best thing to do would be to wait for HCRST to clear for all other platforms
> >> except the one with the issue.
> >>
> >> Udipto Goswami, can you recall the platforms that needed this workaroud?
> >> and do we have an easy way to detect those?
> >
> > Hi Mathias,
> >
> >  From what I recall, we saw this issue coming up on our QCOM mobile
> > platforms but it was not consistent. It was only reported in long runs
> > i believe. The most recent instance when I pushed this patch was with
> > platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> > xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> > I was not able to simulate the scenario for more granular testing and
> > had validated it with long hours stress testing.
> > The callstack was like so:
> >
> > Full call stack on core6:
> > -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> > -001|xhci_handshake(inline)
> > -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> > -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> > -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> > -004|pm_generic_runtime_resume([locdesc] dev = ?)
> > -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> > 0xFFFFFF890F619C10)
> > -006|rpm_callback(inline)
> > -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> > [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> > -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> > -008|pm_runtime_get_sync(inline)
> > -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
>
> Thank you for clarifying this.
>
> So patch avoids the long timeout by always cutting xhci reinit path short in
> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
> xhci_plat_remove()
>
> void xhci_plat_remove(struct platform_device *dev)
> {
>         xhci->xhc_state |= XHCI_STATE_REMOVING;
>         pm_runtime_get_sync(&dev->dev);
>         ...
> }
>
> I think we can revert this patch, and just make sure that we don't reset the
> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
> Just return immediately instead.
>

Just to be sure, are you proposing that we skip xhci_reset() within
the reinit path
of xhci_resume()? If we do that, could that lead to issues with
subsequent operations
in the reinit sequence, such as xhci_init() or xhci_run()?

Do you prefer to group the change to skip xhci_reset() within the
reinit path together
with this revert? or do you want it to be sent and reviewed separately?

Thanks,
Roy