[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <55cfaedc-bd33-610c-143e-0342a14a41b6@quicinc.com>
Date: Tue, 25 Apr 2023 22:28:41 +0530
From: Krishna Kurapati PSSNV <quic_kriskura@...cinc.com>
To: Thinh Nguyen <Thinh.Nguyen@...opsys.com>
CC: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Alan Stern <stern@...land.harvard.edu>,
Geert Uytterhoeven <geert+renesas@...der.be>,
Colin Ian King <colin.i.king@...il.com>,
Jiantao Zhang <water.zhangjiantao@...wei.com>,
"Rafael J . Wysocki" <rafael@...nel.org>,
"linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"quic_ppratap@...cinc.com" <quic_ppratap@...cinc.com>,
"quic_wcheng@...cinc.com" <quic_wcheng@...cinc.com>,
"quic_jackp@...cinc.com" <quic_jackp@...cinc.com>,
"quic_ugoswami@...cinc.com" <quic_ugoswami@...cinc.com>
Subject: Re: [PATCH v2 1/2] usb: dwc3: gadget: Bail out in pullup if soft
reset timeout happens
On 4/6/2023 7:44 AM, Krishna Kurapati PSSNV wrote:
>
>
> On 4/6/2023 6:15 AM, Thinh Nguyen wrote:
>> On Wed, Apr 05, 2023, Krishna Kurapati PSSNV wrote:
>>>
>>>
>>> On 4/5/2023 3:13 AM, Thinh Nguyen wrote:
>>>> On Tue, Apr 04, 2023, Krishna Kurapati PSSNV wrote:
>>>>>
>>>>>
>>>>> On 4/4/2023 5:19 AM, Thinh Nguyen wrote:
>>>>>> On Thu, Mar 30, 2023, Krishna Kurapati PSSNV wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 3/30/2023 5:40 AM, Thinh Nguyen wrote:
>>>>>>>> On Wed, Mar 29, 2023, Krishna Kurapati PSSNV wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/29/2023 2:50 AM, Thinh Nguyen wrote:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> On Tue, Mar 28, 2023, Krishna Kurapati wrote:
>>>>>>>>>>> If the core soft reset timeout happens, avoid setting up event
>>>>>>>>>>> buffers and starting gadget as the writes to these registers
>>>>>>>>>>> may not reflect when in reset and setting the run stop bit
>>>>>>>>>>> can lead the controller to access wrong event buffer address
>>>>>>>>>>> resulting in a crash.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Krishna Kurapati <quic_kriskura@...cinc.com>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/usb/dwc3/gadget.c | 5 ++++-
>>>>>>>>>>> 1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/drivers/usb/dwc3/gadget.c
>>>>>>>>>>> b/drivers/usb/dwc3/gadget.c
>>>>>>>>>>> index 3c63fa97a680..f0472801d9a5 100644
>>>>>>>>>>> --- a/drivers/usb/dwc3/gadget.c
>>>>>>>>>>> +++ b/drivers/usb/dwc3/gadget.c
>>>>>>>>>>> @@ -2620,13 +2620,16 @@ static int dwc3_gadget_pullup(struct
>>>>>>>>>>> usb_gadget *g, int is_on)
>>>>>>>>>>> * device-initiated disconnect requires a core
>>>>>>>>>>> soft reset
>>>>>>>>>>> * (DCTL.CSftRst) before enabling the run/stop
>>>>>>>>>>> bit.
>>>>>>>>>>> */
>>>>>>>>>>> - dwc3_core_soft_reset(dwc);
>>>>>>>>>>> + ret = dwc3_core_soft_reset(dwc);
>>>>>>>>>>> + if (ret)
>>>>>>>>>>> + goto done;
>>>>>>>>>>> dwc3_event_buffers_setup(dwc);
>>>>>>>>>>> __dwc3_gadget_start(dwc);
>>>>>>>>>>> ret = dwc3_gadget_run_stop(dwc, true, false);
>>>>>>>>>>> }
>>>>>>>>>>> +done:
>>>>>>>>>>> pm_runtime_put(dwc->dev);
>>>>>>>>>>> return ret;
>>>>>>>>>>> --
>>>>>>>>>>> 2.40.0
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think there's one more place that may needs this check. Can
>>>>>>>>>> you also
>>>>>>>>>> add this check in __dwc3_set_mode()?
>>>>>>>>>
>>>>>>>>> Hi Thinh,
>>>>>>>>>
>>>>>>>>> Sure. Will do it.
>>>>>>>>> Will the below be good enough ? Or would it be good to add an
>>>>>>>>> error/warn log
>>>>>>>>> there>
>>>>>>>>
>>>>>>>> There's already a warning message in dwc3_core_soft_reset() if
>>>>>>>> it fails.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> kriskura@...kriskura-hyd:/local/mnt/workspace/krishna/skales2/skales/kernel$
>>>>>>>>> git diff drivers/usb/
>>>>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>>>>> index 476b63618511..8d1d213d1dcd 100644
>>>>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>>>>> @@ -210,7 +210,9 @@ static void __dwc3_set_mode(struct
>>>>>>>>> work_struct *work)
>>>>>>>>> }
>>>>>>>>> break;
>>>>>>>>> case DWC3_GCTL_PRTCAP_DEVICE:
>>>>>>>>> - dwc3_core_soft_reset(dwc);
>>>>>>>>> + ret = dwc3_core_soft_reset(dwc);
>>>>>>>>> + if (ret)
>>>>>>>>> + goto out;
>>>>>>>>>
>>>>>>>>> dwc3_event_buffers_setup(dwc);
>>>>>>>>>
>>>>>>>>
>>>>>>>> If soft-reset failed, the controller is in a bad state. We
>>>>>>>> should not
>>>>>>>> perform any further operation until the next hard reset. We
>>>>>>>> should flag
>>>>>>>> the controller as dead. I don't think we have the equivalent of the
>>>>>>>> host's HCD_FLAG_DEAD. It may require some work in the UDC core.
>>>>>>>> Perhaps
>>>>>>>> we can flag within dwc3 for now and prevent any further
>>>>>>>> operation for a
>>>>>>>> simpler fix.
>>>>>>>>
>>>>>>> Hi Thinh,
>>>>>>>
>>>>>>> Are you referring that if __dwc3_set_mode failed with core
>>>>>>> soft reset
>>>>>>> timing out, the caller i.e., dwc3_set_mode who queues the work
>>>>>>> need to know
>>>>>>> that the operation actually failed. So we can add a flag to
>>>>>>> indicate that
>>>>>>> gadget is dead and the caller of dwc3_set_mode can check the flag
>>>>>>> to see if
>>>>>>> the operation is successful or not.
>>>>>>>
>>>>>>> Or am I misunderstanding your comment ?
>>>>>>>
>>>>>>
>>>>>> Not just in __dwc3_set_mode(). I mean any time dwc3_core_soft_reset
>>>>>> fails, then we set this flag. So that it can prevent the user calling
>>>>>> any gadget ops causing more crashes/invalid behavior. The
>>>>>> dwc->softconnect is already wrong on pullup() on failure.
>>>>>>
>>>>>> So that we can have a check in different gadget ops. For pullup():
>>>>>>
>>>>>> static int dwc3_gadget_pullup() {
>>>>>> if (dwc->udc_is_dead) {
>>>>>> dev_err(dev, "reset me. x_x \n");
>>>>>> return;
>>>>>> }
>>>>>>
>>>>>> abc();
>>>>>> }
>>>>>>
>>>>>> Perhaps the effort is probably the same if we enhance the UDC core
>>>>>> for
>>>>>> this? In any case, I'm fine either way.
>>>>>>
>>>>>> Thanks,
>>>>>> Thinh
>>>>>
>>>>> Hi Thinh,
>>>>>
>>>>> So you don't want UDC to retry pullup if it fails the first time
>>>>> ? As per
>>>>> patch-2 of this series, I was trying to propagate the EITMEDOUT to
>>>>> UDC so
>>>>> that the caller (most probably configfs) can take appropriate
>>>>> action as to
>>>>> whether it must retry pullup or not.
>>>>>
>>>>
>>>> Now I'm confused. If the soft-reset times out, that means that the
>>>> soft-reset (self-clearing) bit isn't cleared. How can we retry if it's
>>>> stuck in this state? My impression is that soft-reset would not
>>>> complete
>>>> at all. Is that not the case for you, or it's simply because we need a
>>>> longer soft-reset timeout?
>>>>
>>>> Thanks,
>>>> Thinh
>>>
>>> Hi Thinh,
>>>
>>> Sorry for not being clear. The intention of patch-1 was to ensure
>>> we don't
>>> start the controller if reset times out and patch-2 was to ensure
>>> that UDC
>>> is in sync with controller by understanding that gadget_connect has
>>> failed
>>> and necessary cleanup has to be done in gadget_bind_driver.
>>
>> That should still be there.
>>
>>>
>>> But usually since the UDC_store is the one that is causing pullup to be
>>> called, the error value is propagated back to UDC_store. If it sees a
>>> failure, it does a retry to pullup.
>>>
>>> I didn't check whether subsequent retries by UDC to pullup are helping
>>> clear the reset bit or not. But I thought retrying pullup won't be of
>>> any
>>> harm.
>>
>> It's fine to retry. I'm thinking that the controller is in a bad state
>> at this point that we can't recover (hopefully that's not the case).
>>
>>>
>>> I now get that my patches are incomplete w.r.t handling the timeout.
>>>
>>> IIRC one of the following is what you are suggesting we need to do:
>>>
>>> Option-1:
>>> Set a flag when reset times out and clear it upon core_exit /
>>> core_init. If
>>> the flag is set, block calls to all the gadget_ops in dwc3. Basically
>>> even
>>> if retry happens from configfs/UDC, we bail out in pullup/udc_start even
>>> without trying the requested gadget operation.
>>>
>>> Option-2:
>>> If gadget_connect fails with -ETIMEDOUT in UDC, handle the failure and
>>> implement the same flag in UDC and don't even call any gadget_ops.
>>>
>>
> Hi Thinh,
>
> Thanks for the review.
>
>> I'm thinking of option-1. For option-2, we can't control if the
>> gadget_ops will be called. We only have control how we will respond in
>> case they get called again.
>>
>> But now I'm thinking again, I think it may be ok without adding the
>> flag. The UDC core and gadget driver won't do anything else before
>> pullup(1) is successful. Calling other gadget_ops should be harmless
>> (ie. it won't crash/break the system)?
>>
> I can give this a try in long run testing (For 7-14 days) and see if
> anything else is breaking.
>
> Most probably we do a composition switch / PIPO in between which can
> call usb_gadget_unregister_driver which might invoke a pullup(0)
> followed by udc_stop() and like you mentioned must not be a problem.
>
>> Sorry for the noise, but I think it may be ok without marking the
>> controller dead. I wonder if we can confirm my suspiction on retry? I
>> believe this is not easy to reproduce on your setup? If not, I think we
>> can take your change as is.
Hi Thinh,
I got this patch tested on two diff Gen-2 targets for around 10 days
and no issues were seen. (No SMMU fault seen on a 10 day run). Let me
know of any other concerns that might come up with this patch. Else I
can rebase it to get merged.
Regards,
Krishna,
Powered by blists - more mailing lists