[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20230426010406.d3v4tsubcf2dju6t@synopsys.com>
Date: Wed, 26 Apr 2023 01:04:08 +0000
From: Thinh Nguyen <Thinh.Nguyen@...opsys.com>
To: Krishna Kurapati PSSNV <quic_kriskura@...cinc.com>
CC: Thinh Nguyen <Thinh.Nguyen@...opsys.com>,
Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Alan Stern <stern@...land.harvard.edu>,
Geert Uytterhoeven <geert+renesas@...der.be>,
Colin Ian King <colin.i.king@...il.com>,
Jiantao Zhang <water.zhangjiantao@...wei.com>,
"Rafael J . Wysocki" <rafael@...nel.org>,
"linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"quic_ppratap@...cinc.com" <quic_ppratap@...cinc.com>,
"quic_wcheng@...cinc.com" <quic_wcheng@...cinc.com>,
"quic_jackp@...cinc.com" <quic_jackp@...cinc.com>,
"quic_ugoswami@...cinc.com" <quic_ugoswami@...cinc.com>
Subject: Re: [PATCH v2 1/2] usb: dwc3: gadget: Bail out in pullup if soft
reset timeout happens
On Wed, Apr 26, 2023, Krishna Kurapati PSSNV wrote:
>
>
> On 4/26/2023 5:52 AM, Thinh Nguyen wrote:
> > On Tue, Apr 25, 2023, Krishna Kurapati PSSNV wrote:
> > >
> > >
> > > On 4/6/2023 7:44 AM, Krishna Kurapati PSSNV wrote:
> > > >
> > > >
> > > > On 4/6/2023 6:15 AM, Thinh Nguyen wrote:
> > > > > On Wed, Apr 05, 2023, Krishna Kurapati PSSNV wrote:
> > > > > >
> > > > > >
> > > > > > On 4/5/2023 3:13 AM, Thinh Nguyen wrote:
> > > > > > > On Tue, Apr 04, 2023, Krishna Kurapati PSSNV wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > On 4/4/2023 5:19 AM, Thinh Nguyen wrote:
> > > > > > > > > On Thu, Mar 30, 2023, Krishna Kurapati PSSNV wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On 3/30/2023 5:40 AM, Thinh Nguyen wrote:
> > > > > > > > > > > On Wed, Mar 29, 2023, Krishna Kurapati PSSNV wrote:
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > On 3/29/2023 2:50 AM, Thinh Nguyen wrote:
> > > > > > > > > > > > > Hi,
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Tue, Mar 28, 2023, Krishna Kurapati wrote:
> > > > > > > > > > > > > > If the core soft reset timeout happens, avoid setting up event
> > > > > > > > > > > > > > buffers and starting gadget as the writes to these registers
> > > > > > > > > > > > > > may not reflect when in reset and setting the run stop bit
> > > > > > > > > > > > > > can lead the controller to access wrong event buffer address
> > > > > > > > > > > > > > resulting in a crash.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Signed-off-by: Krishna Kurapati <quic_kriskura@...cinc.com>
> > > > > > > > > > > > > > ---
> > > > > > > > > > > > > > drivers/usb/dwc3/gadget.c | 5 ++++-
> > > > > > > > > > > > > > 1 file changed, 4 insertions(+), 1 deletion(-)
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > diff --git
> > > > > > > > > > > > > > a/drivers/usb/dwc3/gadget.c
> > > > > > > > > > > > > > b/drivers/usb/dwc3/gadget.c
> > > > > > > > > > > > > > index 3c63fa97a680..f0472801d9a5 100644
> > > > > > > > > > > > > > --- a/drivers/usb/dwc3/gadget.c
> > > > > > > > > > > > > > +++ b/drivers/usb/dwc3/gadget.c
> > > > > > > > > > > > > > @@ -2620,13 +2620,16 @@ static
> > > > > > > > > > > > > > int dwc3_gadget_pullup(struct
> > > > > > > > > > > > > > usb_gadget *g, int is_on)
> > > > > > > > > > > > > > *
> > > > > > > > > > > > > > device-initiated disconnect
> > > > > > > > > > > > > > requires a core soft reset
> > > > > > > > > > > > > > * (DCTL.CSftRst)
> > > > > > > > > > > > > > before enabling the run/stop
> > > > > > > > > > > > > > bit.
> > > > > > > > > > > > > > */
> > > > > > > > > > > > > > - dwc3_core_soft_reset(dwc);
> > > > > > > > > > > > > > + ret = dwc3_core_soft_reset(dwc);
> > > > > > > > > > > > > > + if (ret)
> > > > > > > > > > > > > > + goto done;
> > > > > > > > > > > > > > dwc3_event_buffers_setup(dwc);
> > > > > > > > > > > > > > __dwc3_gadget_start(dwc);
> > > > > > > > > > > > > > ret = dwc3_gadget_run_stop(dwc, true, false);
> > > > > > > > > > > > > > }
> > > > > > > > > > > > > > +done:
> > > > > > > > > > > > > > pm_runtime_put(dwc->dev);
> > > > > > > > > > > > > > return ret;
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > > 2.40.0
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think there's one more place that
> > > > > > > > > > > > > may needs this check. Can you also
> > > > > > > > > > > > > add this check in __dwc3_set_mode()?
> > > > > > > > > > > >
> > > > > > > > > > > > Hi Thinh,
> > > > > > > > > > > >
> > > > > > > > > > > > Sure. Will do it.
> > > > > > > > > > > > Will the below be good enough ? Or would
> > > > > > > > > > > > it be good to add an error/warn log
> > > > > > > > > > > > there>
> > > > > > > > > > >
> > > > > > > > > > > There's already a warning message in
> > > > > > > > > > > dwc3_core_soft_reset() if it fails.
> > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > kriskura@...kriskura-hyd:/local/mnt/workspace/krishna/skales2/skales/kernel$
> > > > > > > > > > > > git diff drivers/usb/
> > > > > > > > > > > > diff --git a/drivers/usb/dwc3/core.c b/drivers/usb/dwc3/core.c
> > > > > > > > > > > > index 476b63618511..8d1d213d1dcd 100644
> > > > > > > > > > > > --- a/drivers/usb/dwc3/core.c
> > > > > > > > > > > > +++ b/drivers/usb/dwc3/core.c
> > > > > > > > > > > > @@ -210,7 +210,9 @@ static void
> > > > > > > > > > > > __dwc3_set_mode(struct work_struct
> > > > > > > > > > > > *work)
> > > > > > > > > > > > }
> > > > > > > > > > > > break;
> > > > > > > > > > > > case DWC3_GCTL_PRTCAP_DEVICE:
> > > > > > > > > > > > - dwc3_core_soft_reset(dwc);
> > > > > > > > > > > > + ret = dwc3_core_soft_reset(dwc);
> > > > > > > > > > > > + if (ret)
> > > > > > > > > > > > + goto out;
> > > > > > > > > > > >
> > > > > > > > > > > > dwc3_event_buffers_setup(dwc);
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > If soft-reset failed, the controller is in a
> > > > > > > > > > > bad state. We should not
> > > > > > > > > > > perform any further operation until the next
> > > > > > > > > > > hard reset. We should flag
> > > > > > > > > > > the controller as dead. I don't think we have the equivalent of the
> > > > > > > > > > > host's HCD_FLAG_DEAD. It may require some
> > > > > > > > > > > work in the UDC core. Perhaps
> > > > > > > > > > > we can flag within dwc3 for now and prevent
> > > > > > > > > > > any further operation for a
> > > > > > > > > > > simpler fix.
> > > > > > > > > > >
> > > > > > > > > > Hi Thinh,
> > > > > > > > > >
> > > > > > > > > > Are you referring that if __dwc3_set_mode
> > > > > > > > > > failed with core soft reset
> > > > > > > > > > timing out, the caller i.e., dwc3_set_mode who
> > > > > > > > > > queues the work need to know
> > > > > > > > > > that the operation actually failed. So we can
> > > > > > > > > > add a flag to indicate that
> > > > > > > > > > gadget is dead and the caller of dwc3_set_mode
> > > > > > > > > > can check the flag to see if
> > > > > > > > > > the operation is successful or not.
> > > > > > > > > >
> > > > > > > > > > Or am I misunderstanding your comment ?
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Not just in __dwc3_set_mode(). I mean any time dwc3_core_soft_reset
> > > > > > > > > fails, then we set this flag. So that it can prevent the user calling
> > > > > > > > > any gadget ops causing more crashes/invalid behavior. The
> > > > > > > > > dwc->softconnect is already wrong on pullup() on failure.
> > > > > > > > >
> > > > > > > > > So that we can have a check in different gadget ops. For pullup():
> > > > > > > > >
> > > > > > > > > static int dwc3_gadget_pullup() {
> > > > > > > > > if (dwc->udc_is_dead) {
> > > > > > > > > dev_err(dev, "reset me. x_x \n");
> > > > > > > > > return;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > abc();
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > Perhaps the effort is probably the same if we
> > > > > > > > > enhance the UDC core for
> > > > > > > > > this? In any case, I'm fine either way.
> > > > > > > > >
> > > > > > > > > Thanks,
> > > > > > > > > Thinh
> > > > > > > >
> > > > > > > > Hi Thinh,
> > > > > > > >
> > > > > > > > So you don't want UDC to retry pullup if it fails the
> > > > > > > > first time ? As per
> > > > > > > > patch-2 of this series, I was trying to propagate the
> > > > > > > > EITMEDOUT to UDC so
> > > > > > > > that the caller (most probably configfs) can take
> > > > > > > > appropriate action as to
> > > > > > > > whether it must retry pullup or not.
> > > > > > > >
> > > > > > >
> > > > > > > Now I'm confused. If the soft-reset times out, that means that the
> > > > > > > soft-reset (self-clearing) bit isn't cleared. How can we retry if it's
> > > > > > > stuck in this state? My impression is that soft-reset would
> > > > > > > not complete
> > > > > > > at all. Is that not the case for you, or it's simply because we need a
> > > > > > > longer soft-reset timeout?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Thinh
> > > > > >
> > > > > > Hi Thinh,
> > > > > >
> > > > > > Sorry for not being clear. The intention of patch-1 was to
> > > > > > ensure we don't
> > > > > > start the controller if reset times out and patch-2 was to
> > > > > > ensure that UDC
> > > > > > is in sync with controller by understanding that gadget_connect
> > > > > > has failed
> > > > > > and necessary cleanup has to be done in gadget_bind_driver.
> > > > >
> > > > > That should still be there.
> > > > >
> > > > > >
> > > > > > But usually since the UDC_store is the one that is causing pullup to be
> > > > > > called, the error value is propagated back to UDC_store. If it sees a
> > > > > > failure, it does a retry to pullup.
> > > > > >
> > > > > > I didn't check whether subsequent retries by UDC to pullup are helping
> > > > > > clear the reset bit or not. But I thought retrying pullup won't
> > > > > > be of any
> > > > > > harm.
> > > > >
> > > > > It's fine to retry. I'm thinking that the controller is in a bad state
> > > > > at this point that we can't recover (hopefully that's not the case).
> > > > >
> > > > > >
> > > > > > I now get that my patches are incomplete w.r.t handling the timeout.
> > > > > >
> > > > > > IIRC one of the following is what you are suggesting we need to do:
> > > > > >
> > > > > > Option-1:
> > > > > > Set a flag when reset times out and clear it upon core_exit /
> > > > > > core_init. If
> > > > > > the flag is set, block calls to all the gadget_ops in dwc3.
> > > > > > Basically even
> > > > > > if retry happens from configfs/UDC, we bail out in pullup/udc_start even
> > > > > > without trying the requested gadget operation.
> > > > > >
> > > > > > Option-2:
> > > > > > If gadget_connect fails with -ETIMEDOUT in UDC, handle the failure and
> > > > > > implement the same flag in UDC and don't even call any gadget_ops.
> > > > > >
> > > > >
> > > > Hi Thinh,
> > > >
> > > > Thanks for the review.
> > > >
> > > > > I'm thinking of option-1. For option-2, we can't control if the
> > > > > gadget_ops will be called. We only have control how we will respond in
> > > > > case they get called again.
> > > > >
> > > > > But now I'm thinking again, I think it may be ok without adding the
> > > > > flag. The UDC core and gadget driver won't do anything else before
> > > > > pullup(1) is successful. Calling other gadget_ops should be harmless
> > > > > (ie. it won't crash/break the system)?
> > > > >
> > > > I can give this a try in long run testing (For 7-14 days) and see if
> > > > anything else is breaking.
> > > >
> > > > Most probably we do a composition switch / PIPO in between which can
> > > > call usb_gadget_unregister_driver which might invoke a pullup(0)
> > > > followed by udc_stop() and like you mentioned must not be a problem.
> > > >
> > > > > Sorry for the noise, but I think it may be ok without marking the
> > > > > controller dead. I wonder if we can confirm my suspiction on retry? I
> > > > > believe this is not easy to reproduce on your setup? If not, I think we
> > > > > can take your change as is.
> > >
> > > Hi Thinh,
> > >
> > > I got this patch tested on two diff Gen-2 targets for around 10 days and
> > > no issues were seen. (No SMMU fault seen on a 10 day run). Let me know of
> > > any other concerns that might come up with this patch. Else I can rebase it
> > > to get merged.
> > >
> > > Regards,
> > > Krishna,
> >
> > Thanks for the tests. So you were able to observe the failure and able
> > to recover from it without SMMU fault right?
> >
> Hi Thinh,
>
> Ideally if the issue didn't come up for more than a week straight, I
> believe that we are able to recover from the issue.
>
> Regards,
> Krishna,
Hi Krishna,
I wanted to confirm whether the controller is dead and whether any
attempt to access the controller at this point can be fatal to the
system (such as SMMU fault). I believe your patches prevents SMMU fault
on first failure. I'm not sure what will happen after if we attempt a
retry.
Regardless, your changes are good for now. If we need to make further
enhancement in the future, we'll do that later.
Many thanks for the tests,
Thinh
Powered by blists - more mailing lists