linux-kernel - Re: [RFC] usb: dwc3: core: Fix RAM interface getting stuck during enumeration

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <6b2115c2-ddd1-401e-81ef-e998264bfd89@quicinc.com>
Date:   Sun, 15 Oct 2023 22:18:05 +0530
From:   Krishna Kurapati PSSNV <quic_kriskura@...cinc.com>
To:     Thinh Nguyen <Thinh.Nguyen@...opsys.com>
CC:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        "linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "quic_ppratap@...cinc.com" <quic_ppratap@...cinc.com>,
        "quic_wcheng@...cinc.com" <quic_wcheng@...cinc.com>,
        "quic_jackp@...cinc.com" <quic_jackp@...cinc.com>,
        "quic_ugoswami@...cinc.com" <quic_ugoswami@...cinc.com>
Subject: Re: [RFC] usb: dwc3: core: Fix RAM interface getting stuck during
 enumeration



On 10/13/2023 9:28 AM, Krishna Kurapati PSSNV wrote:
> 
> 
> On 10/13/2023 3:47 AM, Thinh Nguyen wrote:
>> On Fri, Oct 13, 2023, Krishna Kurapati PSSNV wrote:
>>>
>>>
>>> On 10/12/2023 11:29 PM, Thinh Nguyen wrote:
>>>
>>>>> -static int dwc3_gadget_soft_disconnect(struct dwc3 *dwc)
>>>>> +int dwc3_gadget_soft_disconnect(struct dwc3 *dwc)
>>>>>    {
>>>>>        unsigned long flags;
>>>>>        int ret;
>>>>> @@ -2701,7 +2701,7 @@ static int dwc3_gadget_soft_disconnect(struct 
>>>>> dwc3 *dwc)
>>>>>        return ret;
>>>>>    }
>>>>> -static int dwc3_gadget_soft_connect(struct dwc3 *dwc)
>>>>> +int dwc3_gadget_soft_connect(struct dwc3 *dwc)
>>>>>    {
>>>>>        int ret;
>>>>> @@ -3963,6 +3963,7 @@ static void 
>>>>> dwc3_gadget_disconnect_interrupt(struct dwc3 *dwc)
>>>>>        dwc3_gadget_dctl_write_safe(dwc, reg);
>>>>>        dwc->connected = false;
>>>>> +    dwc->cable_disconnected = true;
>>>>>        dwc3_disconnect_gadget(dwc);
>>>>> @@ -4038,6 +4039,7 @@ static void 
>>>>> dwc3_gadget_reset_interrupt(struct dwc3 *dwc)
>>>>>         */
>>>>>        dwc3_stop_active_transfers(dwc);
>>>>>        dwc->connected = true;
>>>>> +    dwc->cable_disconnected = false;
>>>>>        reg = dwc3_readl(dwc->regs, DWC3_DCTL);
>>>>>        reg &= ~DWC3_DCTL_TSTCTRL_MASK;
>>>>> -- 
>>>>> 2.42.0
>>>>>
>>>>
>>>> We can just reset the controller when there's End Transfer command
>>>> timeout as a failure recovery. No need to do what you're doing here.
>>>>
>>> Hi Thinh,
>>>
>>>   That was what I initially wanted to do, but there were couple of 
>>> reasons I
>>> wanted to take this approach:
>>>
>>> 1. We can't just reset the controller in midst of gadget_interrupt. 
>>> We need
>>> to process it completely and then take action.
>>
>> You can flag the driver so you can do the teardown/soft-reset at the
>> appropriate time.
>>
>>>
>>> 2. The above log was seen on QRD variant of SM8550/SM8650 easily. But on
>>> other platforms of same targets, the issue comes up at some other 
>>> instances
>>> of code, at a point where no IRQ is running. In such cases its not 
>>> possible
>>> to accurately find out code portions and reset the controller. The way I
>>> confirmed that both platforms are having the same issue is:
>>>
>>> a. During cable disconnect, I am not receiving disconnect interrupt
>>> b. The reg dump is exactly same in both cases (BMU as well)
>>>
>>> So I felt it was better to fix it during cable disconnect because 
>>> even if we
>>> remove cable, we are still in device mode only and in this case we can
>>> unblock suspend and also bring back controller to a known state.
>>>
>>> Let me know your thoughts on the above.
>>>
>>
>> This issue happens outside of disconnect right? Did you account for port
>> reset?
>>
>> The symptom should be the same. At some point, a command will be issued.
>> If a command timed out, then something is really wrong (especially End
>> Transfer command). We can attempt to recover base on this symptom then.
>>
>> And you don't need to poll for timeout for this specific type of error.
>> Just read some known register like GSNPSID to see if it's invalid.
> 
> Hi Thinh,
> 
>   Yes. It actually happens before disconnect, but I was trying to handle 
> it during disconnect. How about I add a check in gadget_ep_cmd saying 
> that if cmd timeout happens, then read the snpsid and if it reads an 
> invalid value, then I will call soft_disconnect followed by 
> soft_connect. That must cover all cases ?
> 
Hi Thinh,

One more datapoint for putting forward the above opinion. There are two 
types of situations observed during bus reset when we try to do endxfer 
on ep0in followed by set_stall on ep0out:

1. If we issue endxfer for ep0in, and it times out, then the next 
command set_stall is successful (only because all reads give zero and 
since we check for CMDACT bit for ep0out after setting it, and since the 
read of DEPCMD register gives "0" while polling for it, sw indicated 
that set stall is successful. But when we check GEVTADDR at this point, 
both ADDRLO and ADDRHI read zero.

2. If ram access timeout occurs while polling for CMDACT bit after 
issuing endxfer on ep0in, then at that instant, polling returns true, sw 
indicates that endxfer is successful but inreality when checked, 
GEVTADDR LO/HI read zero.

So I think it would be better to check for validity of GEVADDR registers 
at end of gadget_ep_cmd to identify id RAM access timeout happened and 
call soft_disconnect followed by soft_connect.

Let me know your thoughts on this

Regards,
Krishna,