linux-kernel - Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a USB payload burst

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140107212101.GA4199@xanatos>
Date:	Tue, 7 Jan 2014 13:21:01 -0800
From:	Sarah Sharp <sarah.a.sharp@...ux.intel.com>
To:	walt <w41ter@...il.com>
Cc:	Alan Stern <stern@...land.harvard.edu>,
	Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	stable@...r.kernel.org, David Laight <david.laight@...lab.com>,
	linux-usb@...r.kernel.org, linux-scsi@...r.kernel.org
Subject: Re: [PATCH 3.12 033/118] usb: xhci: Link TRB must not occur within a
 USB payload burst

On Tue, Jan 07, 2014 at 05:29:48AM -0800, walt wrote:
> On 01/06/2014 04:31 PM, Sarah Sharp wrote:
> > Hi Walt,
> > 
> > I have a couple of patches for you to test.
> 
> > Please only apply the first patch (which is diagnostic only), trigger
> > your issue, and send me the resulting dmesg.  Then try applying the
> > other two patches, and see if the issue goes away.  (I suspect it won't
> > but I can't be sure.)
> 
> Thanks Sarah.  dmesg0 is from the diagnostic patch only.  dmesg1 has all
> three patches applied.  Some of the messages in dmesg1 fell off the end of
> the kernel buffer, so I may need to make the buffer larger next time but
> I'll need a reminder of how to do it.

Set CONFIG_LOG_BUF_SHIFT to 21.

> As you suspected, the patches didn't fix the problem, sorry.

Yep, I thought so.  I did glean one bit of information from the logs: it
seems that your host does handle no-op TRBs, at least for a while.
However, after a bigger chunk of TRBs, it goes off into la-la-land.

Assuming one of the rings is comprised of two segments:
0xbb711000 (start)
0xbb7113f0 (end)
0xbb711400 (start)
0xbb7117f0 (end)

The log show no-ops were inserted at:
0xbb7207d0
0xbb7206a0
0xbb720be0
0xbb720be0
0xbb720bd0
0xbb7207e0
0xbb711370 = 8 no-ops
0xbb7117c0 = 3
0xbb7113b0 = 4
0xbb7113a0 = 5
0xbb7117d0 = 2
0xbb711340 = 11
0xbb711770 = 8
0xbb711230 = 28
0xbb7117e0 = 1
0xbb7117b0 = 4
0xbb7113d0 = 2
0xbb7117b0 = 4
0xbb711340 = 11
0xbb711690 = 22

So the host was able to process 28 no-op TRBs, but failed on 22 no-ops
later.  The event ring debugging shows the last event was for
0xbb711680, which is the last TRB before the first no-op inserted before
the host died.  There's no Stop Endpoint Command completion, and it
looks like the command was correctly put on the command ring, so it
seems the host is actually hanging for some reason.

Unfortunately, I made a mistake in the debugging patch I sent
you, so it didn't print out the endpoint rings when the host died.  I
need that info, to see whether the link TRB was still intact, or if we
over-wrote it and caused the host to go fetch some invalid memory.

Can you please try the attached patch, on top of the previous three
patches, and send me dmesg?

> I find that I can tell in advance whether the copy is going to succeed,
> just by watching the light flicker on the usb3 drive.  When the flicker
> is absolutely regular, with no variation whatever, I can tell in 10 or
> 15 seconds that the copy will fail.
> 
> At the same time the light on the main drive goes dark after 10 seconds,
> implying that the usb3 drive stops receiving any data from the main drive
> after 10 seconds, yet the light on the usb3 drive continues to flicker as
> if writing data -- even after the cp officially fails.  The light on the
> usb3 drive never stops flickering until I reboot the machine or unplug
> the usb cable.

Interesting.  Without a USB analyzer, we can't really tell what's
happening.  However, one hypothesis could be that the blinking light is
triggered by an active SCSI command (read/write, etc).

There are three phases of the command: setup, data, and status.  I think
your device is getting the setup phase, and the host is dying before it
sends the data phase.  If the light blinks when it gets a setup phase,
and turns off when the devices sends a status phase, that would explain
its behavior.

But that's just a hypothesis, I have no idea whether it's correct.

Sarah Sharp

View attachment "0001-More-debugging.patch" of type "text/x-diff" (2417 bytes)