linux-kernel - Re: [PATCH] usb: xhci: make XHCI_STOP_EP_CMD

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <60227688-4435-516d-2525-b31fabbef273@linux.intel.com>
Date:   Wed, 9 Mar 2022 11:29:29 +0200
From:   Mathias Nyman <mathias.nyman@...ux.intel.com>
To:     Martin Kepplinger <martin.kepplinger@...i.sm>,
        Greg KH <gregkh@...uxfoundation.org>
Cc:     mathias.nyman@...el.com, kernel@...i.sm, linux-usb@...r.kernel.org,
        linux-kernel@...r.kernel.org
Subject: Re: [PATCH] usb: xhci: make XHCI_STOP_EP_CMD_TIMEOUT a module
 parameter

On 9.3.2022 9.56, Martin Kepplinger wrote:
> Am Dienstag, dem 08.03.2022 um 17:17 +0100 schrieb Martin Kepplinger:
>> Am Montag, dem 07.03.2022 um 10:49 +0200 schrieb Mathias Nyman:
>>> On 4.3.2022 16.17, Greg KH wrote:
>>>> On Fri, Mar 04, 2022 at 12:30:57PM +0100, Martin Kepplinger
>>>> wrote:
>>>>> On the Librem 5 imx8mq system we've seen the stop endpoint
>>>>> command
>>>>> time out regularly which results in the hub dying.
>>>>>
>>>>> While on the one hand we see "Port resume timed out, port 1-1:
>>>>> 0xfe3"
>>>>> before this and on the other hand driver-comments suggest that
>>>>> the driver
>>>>> might be able to recover instead of dying here, Sarah seemed to
>>>>> have a
>>>>> workaround for this particulator problem in mind already:
>>>>>
>>>>> Make it a module parameter. So while it might not be the root
>>>>> cause for
>>>>> the problem, do this to give users a workaround.
>>>>
>>>> This is not the 1990's, sorry, please do not add new module
>>>> parameters.
>>>> They modify code, when you want to modify an individual device.
>>>>
>>>
>>> Agree, I think we really need to find the rootcause here.
>>>
>>> There's a known problem with this stop endpoint timeout timer.
>>>
>>> For all other commands we start the timer when the controller
>>> starts
>>> processing the
>>> command, but the stop endpoint timer is started immediately when
>>> command is queued.
>>> So it might timeout if some other commend before it failed.
>>>
>>> I have a patchseries for this. It's still work in progress but
>>> should
>>> be testable.
>>> Pushed to a branch named stop_endpoint_fixes
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git
>>> stop_endpoint_fixes
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mnyman/xhci.git/log/?h=stop_endpoint_fixes
>>>
>>> Can you try it out and see if it helps?
>>>
>>
>> thanks a lot Mathias, I'm running these now. The timeout has not been
>> easy to reproduce (or I'm just lazy) but in a few days I should be
>> able
>> to tell whether that helps.
>>
>> So this thread has been about
>>
>> [14145.960512] xhci-hcd xhci-hcd.4.auto: Port resume timed out, port
>> 1-
>> 1: 0xfe3
>> [14156.308511] xhci-hcd xhci-hcd.4.auto: xHCI host not responding to
>> stop endpoint command.
>>
>> that I previously tried to work around by increasing
>> XHCI_MAX_REXIT_TIMEOUT_MS and XHCI_STOP_EP_CMD_TIMEOUT.
>>
>>
>> These patches can't help with the following, right?
>> readl_poll_timeout_atomic() with a fixed timeout is called in this
>> case:
>>
>> xhci-hcd xhci-hcd.4.auto: Abort failed to stop command ring: -110
>>
>> I see that too from time to time. It results in the HC dying as well.
>>
>> thanks,
>>                               martin
>>
> 
> hi Mathias,
> 
> I already saw "Port resume timed out" and the HC dying running your
> patches. I append the logs.
> 
> So for now I saw more success with increasing
> XHCI_MAX_REXIT_TIMEOUT_MS.
> 

XHCI_MAX_REXIT_TIMEOUT_MS is only 20ms, that we can probably change.

Is 40ms enough?

"Port resume timed out, port 1-1: 0xfe3" means port link state is still in
resume even if we asked link to go to U0 20ms ago.

Maybe this hw combination just resumes slowly.

Thanks
Mathias