linux-kernel - RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI commmand timeout

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <HE1PR0401MB2028787DE01C0C8E79520414E39A0@HE1PR0401MB2028.eurprd04.prod.outlook.com>
Date:	Fri, 1 Apr 2016 03:55:11 +0000
From:	Rajesh Bhagat <rajesh.bhagat@....com>
To:	Mathias Nyman <mathias.nyman@...ux.intel.com>
CC:	"gregkh@...uxfoundation.org" <gregkh@...uxfoundation.org>,
	"linux-usb@...r.kernel.org" <linux-usb@...r.kernel.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Sriram Dash <sriram.dash@....com>
Subject: RE: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
 commmand timeout



> -----Original Message-----
> From: Mathias Nyman [mailto:mathias.nyman@...ux.intel.com]
> Sent: Thursday, March 31, 2016 8:07 PM
> To: Rajesh Bhagat <rajesh.bhagat@....com>
> Cc: gregkh@...uxfoundation.org; linux-usb@...r.kernel.org; linux-
> kernel@...r.kernel.org; Sriram Dash <sriram.dash@....com>
> Subject: Re: [PATCH] usb: xhci: Fix incomplete PM resume operation due to XHCI
> commmand timeout
> 
> On 31.03.2016 06:51, Rajesh Bhagat wrote:
> >
> >
> >
> > Hello Mathias,
> >
> > Thanks a lot for your support.
> >
> > Attached patch is working and allows the completion handler to be
> > called. And resume operation completes and shell comes back (but with lot of
> errors).
> >
> > Now, some other points which needs our attention here are:
> > 1. usb core hub code is not checking the error status of hcd->driver-
> >reset_device(xhci_discover_or_reset_device)
> >      and continues considering reset_device was successful (and causes other issues).
> Hence, a check is needed on return
> >      value of reset_device and proper action is required on it.
> > 2. Even if above return value is checked and reset_device status is used. USB core
> hub retries for N times which is not
> >      required in this case as adding to the resume operation time. But if in this case
> we return -ENOTCONN instead of -EINVAL
> >      from hcd->driver->reset_device(xhci_discover_or_reset_device), code returns
> early and doesn't retry.
> >
> > Proposed Solution for above with your suggested patches is as below:
> >
> > diff --git a/drivers/usb/core/hub.c b/drivers/usb/core/hub.c index
> > 7cad1fa..efeba31 100644
> > --- a/drivers/usb/core/hub.c
> > +++ b/drivers/usb/core/hub.c
> > @@ -2807,13 +2807,17 @@ done:
> >                          struct usb_hcd *hcd = bus_to_hcd(udev->bus);
> >
> >                          update_devnum(udev, 0);
> > -                       /* The xHC may think the device is already reset,
> > -                        * so ignore the status.
> > +                       /*
> > +                        * check the status of resetting the device and update
> > +                        * the udev status.
> >                           */
> >                          if (hcd->driver->reset_device)
> > -                               hcd->driver->reset_device(hcd, udev);
> > +                               status =
> > + hcd->driver->reset_device(hcd, udev);
> >
> > -                       usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +                       if (status == 0)
> > +                               usb_set_device_state(udev, USB_STATE_DEFAULT);
> > +                       else
> > +                               usb_set_device_state(udev,
> > + USB_STATE_NOTATTACHED);
> >                  }
> >          } else {
> >                  if (udev)
> > diff --git a/drivers/usb/host/xhci-ring.c
> > b/drivers/usb/host/xhci-ring.c index b3a0a22..9427175f 100644
> > --- a/drivers/usb/host/xhci-ring.c
> > +++ b/drivers/usb/host/xhci-ring.c
> > @@ -310,6 +310,10 @@ static int xhci_abort_cmd_ring(struct xhci_hcd *xhci)
> >                  return -ESHUTDOWN;
> >          }
> >
> > +       /* writing the CMD_RING_ABORT bit should create a command completion
> > +        * event, add a command completion timeout for it as well
> > +        */
> > +       mod_timer(&xhci->cmd_timer, jiffies +
> > + XHCI_CMD_DEFAULT_TIMEOUT);
> >          return 0;
> >   }
> >
> > @@ -1243,6 +1247,7 @@ void xhci_handle_command_timeout(unsigned long data)
> >          int ret;
> >          unsigned long flags;
> >          u64 hw_ring_state;
> > +       u32 old_status;
> >          struct xhci_command *cur_cmd = NULL;
> >          xhci = (struct xhci_hcd *) data;
> >
> > @@ -1250,6 +1255,7 @@ void xhci_handle_command_timeout(unsigned long data)
> >          spin_lock_irqsave(&xhci->lock, flags);
> >          if (xhci->current_cmd) {
> >                  cur_cmd = xhci->current_cmd;
> > +               old_status = cur_cmd->status;
> >                  cur_cmd->status = COMP_CMD_ABORT;
> >          }
> >
> > @@ -1272,7 +1278,15 @@ void xhci_handle_command_timeout(unsigned long
> data)
> >          }
> >          /* command timeout on stopped ring, ring can't be aborted */
> >          xhci_dbg(xhci, "Command timeout on stopped ring\n");
> > -       xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> > +
> > +       /* is this the second timeout for the same command? ring wont restart */
> > +       if (old_status == COMP_CMD_ABORT) {
> > +               xhci_err(xhci, "Abort timeout, Fail to restart cmd ring\n");
> > +               xhci_cleanup_command_queue(xhci);
> > +       /* everything else here to halt, cleanup, free etc kill xhc */
> > +       } else
> > +               xhci_handle_stopped_cmd_ring(xhci, xhci->current_cmd);
> > +
> >          spin_unlock_irqrestore(&xhci->lock, flags);
> >          return;
> >   }
> > diff --git a/drivers/usb/host/xhci.c b/drivers/usb/host/xhci.c index
> > c502c22..bd18966 100644
> > --- a/drivers/usb/host/xhci.c
> > +++ b/drivers/usb/host/xhci.c
> > @@ -3450,7 +3450,7 @@ int xhci_discover_or_reset_device(struct usb_hcd *hcd,
> struct usb_device *udev)
> >                  if (ret == 1)
> >                          return 0;
> >                  else
> > -                       return -EINVAL;
> > +                       return -ENOTCONN;     /* Don't retry */
> >          }
> >
> >          if (virt_dev->tt_info)
> >
> > Sample output after above patch (timer is set as wakeup source):
> >
> > root@...enix:~# echo mem > /sys/power/state
> > PM: Syncing filesystems ... done.
> > Freezing user space processes ... (elapsed 0.001 seconds) done.
> > Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
> > PM: suspend of devices complete after 155.658 msecs
> > PM: late suspend of devices complete after 1.594 msecs
> > PM: noirq suspend of devices complete after 1.496 msecs
> > PM: noirq resume of devices complete after 1.290 msecs
> > PM: early resume of devices complete after 1.250 msecs usb usb1: root
> > hub lost power or was reset usb usb2: root hub lost power or was reset
> > xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> > xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> > xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> > xhci-hcd xhci-hcd.0.auto: Abort timeout, Fail to restart cmd ring
> > xhci-hcd xhci-hcd.0.auto: Error while assigning device slot ID
> > xhci-hcd xhci-hcd.0.auto: Max number of devices this xHCI host supports is 127.
> > PM: resume of devices complete after 25436.366 msecs     <= resume time is
> increased a lot even after using -ENOTCONN
> > Restarting tasks ...
> > usb 1-1: USB disconnect, device number 2 usb 1-1.2: USB disconnect,
> > device number 3 usb 2-1: USB disconnect, device number 2 done.
> > root@...enix:~#
> >
> > Please share your opinion on other changes for patch submission as well as resume
> time.
> >

> 
> I think more effort should be put into investigating why this happens in the first place.
> What is the root cause? why doesn't xhci start properly after resume in this case.
> 
Hello Mathias, 

I understand your point and would surely debug the root cause of the issue. But implementing the 
fallback mechanism in SW, if HW does not respond well seems to a better solution to me. 

Just to add, code prior to common implementation of xhci_handle_command_timeout was handling the above 
case and was __not__ dependent on HW. 

Please comment on the possibility of above fallback mechanism in XHCI SW and any side effects that you 
can foresee. 

> Optimizing resume time and error paths should be secondary here, resuming faster
> isn't really a matter when xhci is completely stuck.
> 

I agree on above point. 

- Rajesh Bhagat 

> -Mathias