linux-kernel - Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <fbf92981-6601-4ee9-a494-718e322ac1b9@linux.intel.com>
Date: Tue, 20 May 2025 19:18:21 +0300
From: Mathias Nyman <mathias.nyman@...ux.intel.com>
To: Udipto Goswami <udipto.goswami@....qualcomm.com>
Cc: Roy Luo <royluo@...gle.com>, mathias.nyman@...el.com,
 quic_ugoswami@...cinc.com, Thinh.Nguyen@...opsys.com,
 gregkh@...uxfoundation.org, michal.pecio@...il.com,
 linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement
 xhci_handshake_check_state() helper"

On 19.5.2025 21.13, Udipto Goswami wrote:
> On Mon, May 19, 2025 at 6:23 PM Mathias Nyman
> <mathias.nyman@...ux.intel.com> wrote:
>>
>> On 17.5.2025 7.39, Roy Luo wrote:
>>> This reverts commit 6ccb83d6c4972ebe6ae49de5eba051de3638362c.
>>>
>>> Commit 6ccb83d6c497 ("usb: xhci: Implement xhci_handshake_check_state()
>>> helper") was introduced to workaround watchdog timeout issues on some
>>> platforms, allowing xhci_reset() to bail out early without waiting
>>> for the reset to complete.
>>>
>>> Skipping the xhci handshake during a reset is a dangerous move. The
>>> xhci specification explicitly states that certain registers cannot
>>> be accessed during reset in section 5.4.1 USB Command Register (USBCMD),
>>> Host Controller Reset (HCRST) field:
>>> "This bit is cleared to '0' by the Host Controller when the reset
>>> process is complete. Software cannot terminate the reset process
>>> early by writinga '0' to this bit and shall not write any xHC
>>> Operational or Runtime registers until while HCRST is '1'."
>>>
>>> This behavior causes a regression on SNPS DWC3 USB controller with
>>> dual-role capability. When the DWC3 controller exits host mode and
>>> removes xhci while a reset is still in progress, and then tries to
>>> configure its hardware for device mode, the ongoing reset leads to
>>> register access issues; specifically, all register reads returns 0.
>>> These issues extend beyond the xhci register space (which is expected
>>> during a reset) and affect the entire DWC3 IP block, causing the DWC3
>>> device mode to malfunction.
>>
>> I agree with you and Thinh that waiting for the HCRST bit to clear during
>> reset is the right thing to do, especially now when we know skipping it
>> causes issues for SNPS DWC3, even if it's only during remove phase.
>>
>> But reverting this patch will re-introduce the issue originally worked
>> around by Udipto Goswami, causing regression.
>>
>> Best thing to do would be to wait for HCRST to clear for all other platforms
>> except the one with the issue.
>>
>> Udipto Goswami, can you recall the platforms that needed this workaroud?
>> and do we have an easy way to detect those?
> 
> Hi Mathias,
> 
>  From what I recall, we saw this issue coming up on our QCOM mobile
> platforms but it was not consistent. It was only reported in long runs
> i believe. The most recent instance when I pushed this patch was with
> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
> I was not able to simulate the scenario for more granular testing and
> had validated it with long hours stress testing.
> The callstack was like so:
> 
> Full call stack on core6:
> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
> -001|xhci_handshake(inline)
> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
> -004|pm_generic_runtime_resume([locdesc] dev = ?)
> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
> 0xFFFFFF890F619C10)
> -006|rpm_callback(inline)
> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
> -008|pm_runtime_get_sync(inline)
> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)

Thank you for clarifying this.

So patch avoids the long timeout by always cutting xhci reinit path short in
xhci_resume() if resume was caused by pm_runtime_get_sync() call in
xhci_plat_remove()

void xhci_plat_remove(struct platform_device *dev)
{
	xhci->xhc_state |= XHCI_STATE_REMOVING;
	pm_runtime_get_sync(&dev->dev);
	...
}

I think we can revert this patch, and just make sure that we don't reset the
host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
Just return immediately instead.

xhci_reset() will be called with a shorter timeout later in the remove path

Not entirely sure remove path needs to call pm_runtime_get_sync().
I think it just tries to prevent runtime suspend/resume from racing with remove.
PCI code seems to call pm_runtime_get_noresume() in remove path instead.

Thanks
Mathias