linux-kernel - Re: [PATCH v2 1/2] usb: dwc3: Not set DWC3_EP_END_TRANSFER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <e3332511-82d3-2892-ad72-a0c167273174@synopsys.com>
Date:   Thu, 17 Feb 2022 23:43:22 +0000
From:   Thinh Nguyen <Thinh.Nguyen@...opsys.com>
To:     Wesley Cheng <quic_wcheng@...cinc.com>,
        Thinh Nguyen <Thinh.Nguyen@...opsys.com>,
        Jung Daehwan <dh10.jung@...sung.com>
CC:     Felipe Balbi <balbi@...nel.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
        open list <linux-kernel@...r.kernel.org>,
        "quic_jackp@...cinc.com" <quic_jackp@...cinc.com>
Subject: Re: [PATCH v2 1/2] usb: dwc3: Not set DWC3_EP_END_TRANSFER_PENDING in
 ep cmd fails

Wesley Cheng wrote:
> Hi Thinh,
> 
> On 2/15/2022 9:14 AM, Thinh Nguyen wrote:
>> Jung Daehwan wrote:
>>> Hi Thinh,
>>>
>>> On Mon, Feb 14, 2022 at 06:44:33PM +0000, Thinh Nguyen wrote:
>>>> Hi,
>>>>
>>>> Daehwan Jung wrote:
>>>>> It always sets DWC3_EP_END_TRANSFER_PENDING in dwc3_stop_active_transfer
>>>>> even if dwc3_send_gadget_ep_cmd fails. It can cause some problems like
>>>>
>>>> How does it fail? Timed out?
>>>
>>> Yes, timed out.
>>>>
>>>>> skipping clear stall commmand or giveback from dequeue. We fix to set it
>>>>> only when ep cmd success. Additionally, We clear DWC3_EP_TRANSFER_STARTED
>>>>> for next trb to start transfer not update transfer.
>>>>
>>>> We shouldn't do this. Things will be out of sync. It may work for 1
>>>> scenario, but it won't work for others.
>>>>
>>>> Please help me understand a few things:
>>>>
>>>> 1) What is the scenario that triggers this? Is it random?
>>>>
>>> ep cmd timeout occurs on dequeue request from user side. End Transfer command
>>> would be sent in dwc3_stop_active transfer.
>>
>> At the high level, what's triggering the request dequeue? Is it from a
>> disconnect, change of interface, or simply function driver protocol that
>> changes it.
>>
>> What application was used to trigger this?
>>
> Sorry for jumping in here, but looks like Daehwan is running into a
> similar issue I am seeing as well.
> 
> At least in my scenario, the dequeue is coming from a function driver
> which exposes a device to userspace.  Once that device is closed, it
> will issue a dequeue on all pending/submitted requests.

Dequeuing request is coming from the function driver, but what causes
the dequeue. For example, the End Transfer command due to a disconnect
may give a different clues than a dequeue from a change of interface.

> 
>>>
>>>> 2) Are there other traffics pending while issuing the End Transfer
>>>> command? If so, what transfer type(s)?
>>>>
>>> I haven't checked it yet. 
>>
>> Can you check?
>>
> For the cases where we've collected a crash log, we can see that during
> the END transfer timeouts there was always a pending EP0 transaction.
> We had reached out to our internal HW folks to get some inputs on what
> could be causing this kind of issue, and we were able to get some
> recommendations from their Synopsis POCs.

It's "Synopsys" :)

> 
> It was mentioned that if there was an active EP0 transfer, an end
> transfer command on a non-control EP can fail w/ timed out.
> 

What controller version are you using? And what version is Jung using?
Do you have the STAR number of the issue. If you're using a different
version than Jung's, then it may not be the same issue.

>>>
>>>> 3) Have you tried increasing the timeout?
>>>>
>>> No, I haven't.
>>
>> Can you try up to 10 seconds (just for experiment)
>>
> I've tried this too, and it did not help.
> 
>>>> BR,
>>>> Thinh
>>>>
>>>
>>> This issue occurs very rarely on customer. I only have restricted
>>> information. That's why I've been trying to reproduce it.
>>
>> How did you test your fix if you can't reproduce it?
>>
>>>
>>> Wesley may have run into same issue and you can see this issue in detail.
>>> https://urldefense.com/v3/__https://protect2.fireeye.com/v1/url?k=9d423b69-fc3fd32e-9d43b026-74fe485fff30-77a099b52659410d&q=1&e=20b4d9f5-2599-4f57-8b6a-7c4ec167d228&u=https*3A*2F*2Flore.kernel.org*2Flinux-usb*2F20220203080017.27339-1-quic_wcheng*40quicinc.com*2F__;JSUlJSUlJQ!!A4F2R9G_pg!JWPzNLoO3BFX_IZCVzmHPtxq6frr_VFbSNNaxSQylunt1Y4EauTOefth2LCIcVEuTx8E$ 
>>>
>>
>> I can take a look, but please provide the tracepoints of the failure if
>> you can reproduce it.
>>
> Let me see if I have any previous traces I can share.  If not, I have a
> pretty reliable repro set up I can collect a trace for you.  For now, I
> will focus on just getting the endxfer timeout seen during ep dequeue.
> As mentioned on my patchset, this can happen during device-initiated
> disconnect as well.
> 

Your patch set is still on my todo list. I haven't reviewed it. There's
some concern looking at it from a first glance, I'll check it out more
thoroughly later.

Can you provide the tracepoints?

Thanks,
Thinh