lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 26 Apr 2023 06:13:18 +0530
From:   Krishna Kurapati PSSNV <quic_kriskura@...cinc.com>
To:     Thinh Nguyen <Thinh.Nguyen@...opsys.com>
CC:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Alan Stern <stern@...land.harvard.edu>,
        Geert Uytterhoeven <geert+renesas@...der.be>,
        Colin Ian King <colin.i.king@...il.com>,
        Jiantao Zhang <water.zhangjiantao@...wei.com>,
        "Rafael J . Wysocki" <rafael@...nel.org>,
        "linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "quic_ppratap@...cinc.com" <quic_ppratap@...cinc.com>,
        "quic_wcheng@...cinc.com" <quic_wcheng@...cinc.com>,
        "quic_jackp@...cinc.com" <quic_jackp@...cinc.com>,
        "quic_ugoswami@...cinc.com" <quic_ugoswami@...cinc.com>
Subject: Re: [PATCH v2 1/2] usb: dwc3: gadget: Bail out in pullup if soft
 reset timeout happens



On 4/26/2023 5:52 AM, Thinh Nguyen wrote:
> On Tue, Apr 25, 2023, Krishna Kurapati PSSNV wrote:
>>
>>
>> On 4/6/2023 7:44 AM, Krishna Kurapati PSSNV wrote:
>>>
>>>
>>> On 4/6/2023 6:15 AM, Thinh Nguyen wrote:
>>>> On Wed, Apr 05, 2023, Krishna Kurapati PSSNV wrote:
>>>>>
>>>>>
>>>>> On 4/5/2023 3:13 AM, Thinh Nguyen wrote:
>>>>>> On Tue, Apr 04, 2023, Krishna Kurapati PSSNV wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 4/4/2023 5:19 AM, Thinh Nguyen wrote:
>>>>>>>> On Thu, Mar 30, 2023, Krishna Kurapati PSSNV wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 3/30/2023 5:40 AM, Thinh Nguyen wrote:
>>>>>>>>>> On Wed, Mar 29, 2023, Krishna Kurapati PSSNV wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 3/29/2023 2:50 AM, Thinh Nguyen wrote:
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Mar 28, 2023, Krishna Kurapati wrote:
>>>>>>>>>>>>> If the core soft reset timeout happens, avoid setting up event
>>>>>>>>>>>>> buffers and starting gadget as the writes to these registers
>>>>>>>>>>>>> may not reflect when in reset and setting the run stop bit
>>>>>>>>>>>>> can lead the controller to access wrong event buffer address
>>>>>>>>>>>>> resulting in a crash.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Signed-off-by: Krishna Kurapati <quic_kriskura@...cinc.com>
>>>>>>>>>>>>> ---
>>>>>>>>>>>>>        drivers/usb/dwc3/gadget.c | 5 ++++-
>>>>>>>>>>>>>        1 file changed, 4 insertions(+), 1 deletion(-)
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git
>>>>>>>>>>>>> a/drivers/usb/dwc3/gadget.c
>>>>>>>>>>>>> b/drivers/usb/dwc3/gadget.c
>>>>>>>>>>>>> index 3c63fa97a680..f0472801d9a5 100644
>>>>>>>>>>>>> --- a/drivers/usb/dwc3/gadget.c
>>>>>>>>>>>>> +++ b/drivers/usb/dwc3/gadget.c
>>>>>>>>>>>>> @@ -2620,13 +2620,16 @@ static
>>>>>>>>>>>>> int dwc3_gadget_pullup(struct
>>>>>>>>>>>>> usb_gadget *g, int is_on)
>>>>>>>>>>>>>                 *
>>>>>>>>>>>>> device-initiated disconnect
>>>>>>>>>>>>> requires a core soft reset
>>>>>>>>>>>>>                 * (DCTL.CSftRst)
>>>>>>>>>>>>> before enabling the run/stop
>>>>>>>>>>>>> bit.
>>>>>>>>>>>>>                 */
>>>>>>>>>>>>> -        dwc3_core_soft_reset(dwc);
>>>>>>>>>>>>> +        ret = dwc3_core_soft_reset(dwc);
>>>>>>>>>>>>> +        if (ret)
>>>>>>>>>>>>> +            goto done;
>>>>>>>>>>>>>                dwc3_event_buffers_setup(dwc);
>>>>>>>>>>>>>                __dwc3_gadget_start(dwc);
>>>>>>>>>>>>>                ret = dwc3_gadget_run_stop(dwc, true, false);
>>>>>>>>>>>>>            }
>>>>>>>>>>>>> +done:
>>>>>>>>>>>>>            pm_runtime_put(dwc->dev);
>>>>>>>>>>>>>            return ret;
>>>>>>>>>>>>> -- 
>>>>>>>>>>>>> 2.40.0
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I think there's one more place that
>>>>>>>>>>>> may needs this check. Can you also
>>>>>>>>>>>> add this check in __dwc3_set_mode()?
>>>>>>>>>>>
>>>>>>>>>>> Hi Thinh,
>>>>>>>>>>>
>>>>>>>>>>>        Sure. Will do it.
>>>>>>>>>>> Will the below be good enough ? Or would
>>>>>>>>>>> it be good to add an error/warn log
>>>>>>>>>>> there>
>>>>>>>>>>
>>>>>>>>>> There's already a warning message in
>>>>>>>>>> dwc3_core_soft_reset() if it fails.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> kriskura@...kriskura-hyd:/local/mnt/workspace/krishna/skales2/skales/kernel$
>>>>>>>>>>> git diff drivers/usb/
>>>>>>>>>>> diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
>>>>>>>>>>> index 476b63618511..8d1d213d1dcd 100644
>>>>>>>>>>> --- a/drivers/usb/dwc3/core.c
>>>>>>>>>>> +++ b/drivers/usb/dwc3/core.c
>>>>>>>>>>> @@ -210,7 +210,9 @@ static void
>>>>>>>>>>> __dwc3_set_mode(struct work_struct
>>>>>>>>>>> *work)
>>>>>>>>>>>                      }
>>>>>>>>>>>                      break;
>>>>>>>>>>>              case DWC3_GCTL_PRTCAP_DEVICE:
>>>>>>>>>>> -               dwc3_core_soft_reset(dwc);
>>>>>>>>>>> +               ret = dwc3_core_soft_reset(dwc);
>>>>>>>>>>> +               if (ret)
>>>>>>>>>>> +                       goto out;
>>>>>>>>>>>
>>>>>>>>>>>                      dwc3_event_buffers_setup(dwc);
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If soft-reset failed, the controller is in a
>>>>>>>>>> bad state. We should not
>>>>>>>>>> perform any further operation until the next
>>>>>>>>>> hard reset. We should flag
>>>>>>>>>> the controller as dead. I don't think we have the equivalent of the
>>>>>>>>>> host's HCD_FLAG_DEAD. It may require some
>>>>>>>>>> work in the UDC core. Perhaps
>>>>>>>>>> we can flag within dwc3 for now and prevent
>>>>>>>>>> any further operation for a
>>>>>>>>>> simpler fix.
>>>>>>>>>>
>>>>>>>>> Hi Thinh,
>>>>>>>>>
>>>>>>>>>      Are you referring that if __dwc3_set_mode
>>>>>>>>> failed with core soft reset
>>>>>>>>> timing out, the caller i.e., dwc3_set_mode who
>>>>>>>>> queues the work need to know
>>>>>>>>> that the operation actually failed. So we can
>>>>>>>>> add a flag to indicate that
>>>>>>>>> gadget is dead and the caller of dwc3_set_mode
>>>>>>>>> can check the flag to see if
>>>>>>>>> the operation is successful or not.
>>>>>>>>>
>>>>>>>>> Or am I misunderstanding your comment ?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Not just in __dwc3_set_mode(). I mean any time dwc3_core_soft_reset
>>>>>>>> fails, then we set this flag. So that it can prevent the user calling
>>>>>>>> any gadget ops causing more crashes/invalid behavior. The
>>>>>>>> dwc->softconnect is already wrong on pullup() on failure.
>>>>>>>>
>>>>>>>> So that we can have a check in different gadget ops. For pullup():
>>>>>>>>
>>>>>>>> static int dwc3_gadget_pullup() {
>>>>>>>>      if (dwc->udc_is_dead) {
>>>>>>>>          dev_err(dev, "reset me. x_x \n");
>>>>>>>>          return;
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      abc();
>>>>>>>> }
>>>>>>>>
>>>>>>>> Perhaps the effort is probably the same if we
>>>>>>>> enhance the UDC core for
>>>>>>>> this? In any case, I'm fine either way.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thinh
>>>>>>>
>>>>>>> Hi Thinh,
>>>>>>>
>>>>>>>     So you don't want UDC to retry pullup if it fails the
>>>>>>> first time ? As per
>>>>>>> patch-2 of this series, I was trying to propagate the
>>>>>>> EITMEDOUT to UDC so
>>>>>>> that the caller (most probably configfs) can take
>>>>>>> appropriate action as to
>>>>>>> whether it must retry pullup or not.
>>>>>>>
>>>>>>
>>>>>> Now I'm confused. If the soft-reset times out, that means that the
>>>>>> soft-reset (self-clearing) bit isn't cleared. How can we retry if it's
>>>>>> stuck in this state? My impression is that soft-reset would
>>>>>> not complete
>>>>>> at all. Is that not the case for you, or it's simply because we need a
>>>>>> longer soft-reset timeout?
>>>>>>
>>>>>> Thanks,
>>>>>> Thinh
>>>>>
>>>>> Hi Thinh,
>>>>>
>>>>>     Sorry for not being clear. The intention of patch-1 was to
>>>>> ensure we don't
>>>>> start the controller if reset times out and patch-2 was to
>>>>> ensure that UDC
>>>>> is in sync with controller by understanding that gadget_connect
>>>>> has failed
>>>>> and necessary cleanup has to be done in gadget_bind_driver.
>>>>
>>>> That should still be there.
>>>>
>>>>>
>>>>> But usually since the UDC_store is the one that is causing pullup to be
>>>>> called, the error value is propagated back to UDC_store. If it sees a
>>>>> failure, it does a retry to pullup.
>>>>>
>>>>> I didn't check  whether subsequent retries by UDC to pullup are helping
>>>>> clear the reset bit or not. But I thought retrying pullup won't
>>>>> be of any
>>>>> harm.
>>>>
>>>> It's fine to retry. I'm thinking that the controller is in a bad state
>>>> at this point that we can't recover (hopefully that's not the case).
>>>>
>>>>>
>>>>> I now get that my patches are incomplete w.r.t handling the timeout.
>>>>>
>>>>> IIRC one of the following is what you are suggesting we need to do:
>>>>>
>>>>> Option-1:
>>>>> Set a flag when reset times out and clear it upon core_exit /
>>>>> core_init. If
>>>>> the flag is set, block calls to all the gadget_ops in dwc3.
>>>>> Basically even
>>>>> if retry happens from configfs/UDC, we bail out in pullup/udc_start even
>>>>> without trying the requested gadget operation.
>>>>>
>>>>> Option-2:
>>>>> If gadget_connect fails with -ETIMEDOUT in UDC, handle the failure and
>>>>> implement the same flag in UDC and don't even call any gadget_ops.
>>>>>
>>>>
>>> Hi Thinh,
>>>
>>> Thanks for the review.
>>>
>>>> I'm thinking of option-1. For option-2, we can't control if the
>>>> gadget_ops will be called. We only have control how we will respond in
>>>> case they get called again.
>>>>
>>>> But now I'm thinking again, I think it may be ok without adding the
>>>> flag. The UDC core and gadget driver won't do anything else before
>>>> pullup(1) is successful. Calling other gadget_ops should be harmless
>>>> (ie. it won't crash/break the system)?
>>>>
>>> I can give this a try in long run testing (For 7-14 days) and see if
>>> anything else is breaking.
>>>
>>> Most probably we do a composition switch / PIPO in between which can
>>> call usb_gadget_unregister_driver which might invoke a pullup(0)
>>> followed by udc_stop() and like you mentioned must not be a problem.
>>>
>>>> Sorry for the noise, but I think it may be ok without marking the
>>>> controller dead. I wonder if we can confirm my suspiction on retry? I
>>>> believe this is not easy to reproduce on your setup? If not, I think we
>>>> can take your change as is.
>>
>> Hi Thinh,
>>
>>    I got this patch tested on two diff Gen-2 targets for around 10 days and
>> no issues were seen. (No SMMU fault seen on a 10 day run). Let me know of
>> any other concerns that might come up with this patch. Else I can rebase it
>> to get merged.
>>
>> Regards,
>> Krishna,
> 
> Thanks for the tests. So you were able to observe the failure and able
> to recover from it without SMMU fault right?
> 
Hi Thinh,

   Ideally if the issue didn't come up for more than a week straight, I 
believe that we are able to recover from the issue.

Regards,
Krishna,

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ