linux-kernel - Re: [PATCH v3] usb: dwc3: gadget: Propagate core init errors to UDC during pullup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <7efa451e-2601-f448-af33-b844091db264@quicinc.com>
Date:   Tue, 20 Jun 2023 10:23:55 +0530
From:   Krishna Kurapati PSSNV <quic_kriskura@...cinc.com>
To:     Johan Hovold <johan@...nel.org>,
        Thinh Nguyen <Thinh.Nguyen@...opsys.com>
CC:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        <linux-usb@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
        <quic_ppratap@...cinc.com>, <quic_wcheng@...cinc.com>,
        <quic_jackp@...cinc.com>, <quic_ugoswami@...cinc.com>
Subject: Re: [PATCH v3] usb: dwc3: gadget: Propagate core init errors to UDC
 during pullup



On 6/19/2023 8:36 PM, Johan Hovold wrote:
> On Mon, Jun 19, 2023 at 06:20:43PM +0530, Krishna Kurapati PSSNV wrote:
>> On 6/19/2023 12:36 PM, Johan Hovold wrote:
>>> On Sun, Jun 18, 2023 at 05:39:49PM +0530, Krishna Kurapati wrote:
> 
>>>> @@ -2747,7 +2747,9 @@ static int dwc3_gadget_pullup(struct usb_gadget *g, int is_on)
>>>>    	ret = pm_runtime_get_sync(dwc->dev);
>>>>    	if (!ret || ret < 0) {
>>>>    		pm_runtime_put(dwc->dev);
>>>> -		return 0;
>>>> +		if (ret < 0)
>>>> +			pm_runtime_set_suspended(dwc->dev);
>>>
>>> This bit is broken and is also not mentioned or explained in the commit
>>> message. What are you trying to achieve here?
>>>
>>> You cannot set the state like this after runtime PM is enabled and the
>>> above call will always fail.
> 
>> The reason why I an returning ret is because, when the first get_sync
>> fails because of core_init failure and we return 0 instead of ret, the
>> UDC thinks that controller has started successfully but we never set the
>> run stop bit.
> 
> That bit is clear.
> 
>> So when we plug out the cable,  the disconnect event won't
>> be generated and we never send on systems like android the user space
>> will never clear the UDC upon disconnect. Its a sort of mismatch between
>> controller and udc.
> 
> Ok, but the controller is an error state after the resume failure. And
> here you rely on user space to retry gadget activation in order to
> eventually detect the disconnect event?
>   
>> Also once the first get_sync fails, the dwc->dev->power.runtime_error
>> flag is set and successive calls to get_sync always return -EINVAL. In
>> this situation even if UDC/configfs retry pullup, resume_common will
>> never be called and we never actually start the controller or resume
>> dwc->dev.
>>
>> By calling set_suspended, I am trying to clear the runtime_error flag so
>> that the next retry to pullup will call resume_common and retry
>> core_init and set run_stop.
> 
> Ok, thanks, that's the bit I was missing in the commit message.
> 
> First, I perhaps mistakingly thought pm_runtime_set_suspended() may only
> be called with PM runtime disabled, but it appears it may indeed be
> valid to call also after an error but with the caveat that the device
> must then actually be in the suspended state.
> 
> The documentation and implementation is inconsistent here as the kernel
> doc for pm_runtime_set_suspended() clearly states:
> 
> 	It is not valid to call this function for devices with runtime
> 	PM enabled.
> 
> and it also looks like we'd end up with an active-child counter
> imbalance if anyone actually tries to do so.
> 
> But either way, it also seems like the controller is not guaranteed to
> be suspended here as pm_runtime_get_sync() may also fail after a
> previous errors that have left the controller in the active state?
>  > Also, what kind of errors would cause core_init and resume to fail here?
> 
Hi Johan,

   As per the comment just above the get_sync during pullup, the 
resume_common path is used to resume the controller and start peripheral 
mode incase the dwc3 was in suspended state. So if we are entering 
gadget_resume we are in suspended state the first time it is called.

Regarding the errors that might leave controller in active state, I have 
faced issue in core init, and in that function there is a cleanup 
happening in case something fails. So controller was actually not in 
active state after cleanup was done. In resume common, if 
core_init_for_resume is failing, we cleanup everything initialized up 
until that point.

The scenario you mentioned would be applicable in case gadget_resume 
fails. We are not having a return value check and I am not sure what 
would be the side effect of not having that check there. Either ways, 
since it was failing for core init, I went ahead and made this patch.

As per the reason for failure in core init, the following is what was 
happening at customer's end:

1. Cable plug-in
2. get_sync calls resume common which inturn calls core_init
3. core soft reset fails in core init, we cleanup and return -110
4. After applying this patch, the -110 was propagated to UDC properly
5. We got a second call to pullup via connect_control and this time 
reset was successful.

The behavior was similar to [1]. There as well, on all Gen-2 targets, 
after the retry happens I see soft reset is passing but failing for 
first attempt.

[1]: 
https://lore.kernel.org/all/20230510075252.31023-2-quic_kriskura@quicinc.com/

Regards,
Krishna,

> If this is something that you see during normal operation then this
> seems to suggest that something is wrong with the runtime pm
> implementation.
> 
> Note that virtually all drivers treat resume failures as fatal errors
> and do not implement any recovery from that.
> 
> In fact, the only other example of this kind of usage that I could find
> is also for a Qualcomm driver...
> 
> Johan