linux-kernel - Re: [PATCH v1] Revert "usb: xhci: Implement xhci_handshake_check

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <6bfee225-7519-41ab-8ae9-99267c5ce06e@intel.com>
Date: Thu, 22 May 2025 15:24:04 +0300
From: Mathias Nyman <mathias.nyman@...el.com>
To: Roy Luo <royluo@...gle.com>, Mathias Nyman <mathias.nyman@...ux.intel.com>
Cc: Udipto Goswami <udipto.goswami@....qualcomm.com>,
 quic_ugoswami@...cinc.com, Thinh.Nguyen@...opsys.com,
 gregkh@...uxfoundation.org, michal.pecio@...il.com,
 linux-usb@...r.kernel.org, linux-kernel@...r.kernel.org,
 stable@...r.kernel.org
Subject: Re: [PATCH v1] Revert "usb: xhci: Implement
 xhci_handshake_check_state() helper"

On 22.5.2025 5.21, Roy Luo wrote:
>>>> Udipto Goswami, can you recall the platforms that needed this workaroud?
>>>> and do we have an easy way to detect those?
>>>
>>> Hi Mathias,
>>>
>>>   From what I recall, we saw this issue coming up on our QCOM mobile
>>> platforms but it was not consistent. It was only reported in long runs
>>> i believe. The most recent instance when I pushed this patch was with
>>> platform SM8650, it was a watchdog timeout issue where xhci_reset() ->
>>> xhci_handshake() polling read timeout upon xhci remove. Unfortunately
>>> I was not able to simulate the scenario for more granular testing and
>>> had validated it with long hours stress testing.
>>> The callstack was like so:
>>>
>>> Full call stack on core6:
>>> -000|readl([X19] addr = 0xFFFFFFC03CC08020)
>>> -001|xhci_handshake(inline)
>>> -001|xhci_reset([X19] xhci = 0xFFFFFF8942052250, [X20] timeout_us = 10000000)
>>> -002|xhci_resume([X20] xhci = 0xFFFFFF8942052250, [?] hibernated = ?)
>>> -003|xhci_plat_runtime_resume([locdesc] dev = ?)
>>> -004|pm_generic_runtime_resume([locdesc] dev = ?)
>>> -005|__rpm_callback([X23] cb = 0xFFFFFFE3F09307D8, [X22] dev =
>>> 0xFFFFFF890F619C10)
>>> -006|rpm_callback(inline)
>>> -006|rpm_resume([X19] dev = 0xFFFFFF890F619C10,
>>> [NSD:0xFFFFFFC041453AD4] rpmflags = 4)
>>> -007|__pm_runtime_resume([X20] dev = 0xFFFFFF890F619C10, [X19] rpmflags = 4)
>>> -008|pm_runtime_get_sync(inline)
>>> -008|xhci_plat_remove([X20] dev = 0xFFFFFF890F619C00)
>>
>> Thank you for clarifying this.
>>
>> So patch avoids the long timeout by always cutting xhci reinit path short in
>> xhci_resume() if resume was caused by pm_runtime_get_sync() call in
>> xhci_plat_remove()
>>
>> void xhci_plat_remove(struct platform_device *dev)
>> {
>>          xhci->xhc_state |= XHCI_STATE_REMOVING;
>>          pm_runtime_get_sync(&dev->dev);
>>          ...
>> }
>>
>> I think we can revert this patch, and just make sure that we don't reset the
>> host in the reinit path of xhci_resume() if XHCI_STATE_REMOVING is set.
>> Just return immediately instead.
>>
> 
> Just to be sure, are you proposing that we skip xhci_reset() within
> the reinit path
> of xhci_resume()? If we do that, could that lead to issues with
> subsequent operations
> in the reinit sequence, such as xhci_init() or xhci_run()?

I suggest to only skip xhci_reset in xhci_resume() if XHCI_STATE_REMOVING is set.

This should be similar to what is going on already.

xhci_reset() currently returns -ENODEV if XHCI_STATE_REMOVING is set, unless reset
completes extremely fast. xhci_resume() bails out if xhci_reset() returns error:

xhci_resume()
   ...
   if (power_lost) {
     ...
     retval = xhci_reset(xhci, XHCI_RESET_LONG_USEC);
     spin_unlock_irq(&xhci->lock);
     if (retval)
       return retval;
> 
> Do you prefer to group the change to skip xhci_reset() within the
> reinit path together
> with this revert? or do you want it to be sent and reviewed separately?

First a patch that bails out from xhci_resume() if XHCI_STATE_REMOVING is set
and we are in the reinit (power_lost) path about to call xhci_reset();

Then a second patch that reverts 6ccb83d6c497 ("usb: xhci: Implement
xhci_handshake_check_state()

Does this sound reasonable?

should avoid the QCOM 10sec watchdog issue as next xhci_rest() called
in xhci_remove() path has a short 250ms timeout, and ensure the
SNPS DWC3 USB regression won't trigger.

Thanks
Mathias