netdev - RE: transmit lockup using smsc95xx ethernet on usb3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Thu, 24 Oct 2013 16:05:28 +0100
From:	"David Laight" <David.Laight@...LAB.COM>
To:	"Sarah Sharp" <sarah.a.sharp@...ux.intel.com>
Cc:	<netdev@...r.kernel.org>, <linux-usb@...r.kernel.org>,
	"Xenia Ragiadakou" <burzalodowa@...il.com>
Subject: RE: transmit lockup using smsc95xx ethernet on usb3

> Have you tried the latest stable kernel or the latest -rc kernel?

I've built a kernel based on Linus's tree from last Friday, 3.12-rc6 ish.
Commented out the trace for short reads - happens all the time.

I've not seen an error on a Bo yet, the failure rate is depressingly low.
I have had an error that started with an interrupt completion (I think
from the root), that leads to a full unplug-replug sequence that
overfilled the kernel message buffer (rebuilt with a much bigger buffer).

I've just got a error -71 from a Bi. This generates:
(There are actually two separated by a Bo completion and setup.

[175908.563068] xhci_hcd 0000:00:14.0: Transfer error on endpoint
[175908.563070] xhci_hcd 0000:00:14.0: Cleaning up stalled endpoint ring
[175908.563072] xhci_hcd 0000:00:14.0: Finding segment containing stopped TRB.
[175908.563074] xhci_hcd 0000:00:14.0: Finding endpoint context
[175908.563076] xhci_hcd 0000:00:14.0: Finding segment containing last TRB in TD.
[175908.563078] xhci_hcd 0000:00:14.0: Cycle state = 0x1
[175908.563081] xhci_hcd 0000:00:14.0: New dequeue segment = ffff8800d6a817e0 (virtual)
[175908.563089] xhci_hcd 0000:00:14.0: New dequeue pointer = 0x2137a4b90 (DMA)
[175908.563096] xhci_hcd 0000:00:14.0: Queueing new dequeue state
[175908.563104] xhci_hcd 0000:00:14.0: Set TR Deq Ptr cmd, new deq seg = ffff8800d6a817e0 (0x2137a4800 dma), new deq ptr = ffff8802137a4b90 (0x2137a4b90 dma), new cycle = 1
[175908.563110] xhci_hcd 0000:00:14.0: // Ding dong!
[175908.563119] xhci_hcd 0000:00:14.0: Giveback URB ffff8802114fd9c0, len = 0, expected = 18944, status = -71
[175908.563122] xhci_hcd 0000:00:14.0: Ignoring reset ep completion code of 1
[175908.563125] xhci_hcd 0000:00:14.0: Successful Set TR Deq Ptr cmd, deq = @2137a4b91

What is interesting is that the usbmon trace then shows only 2 Bi URB being used
(rather than 4). After 125ms (assuming the usbmon timestamps are us)
another 2 URB are added.
Since the URB are usually recycled (or at least freed and immediately realloced
so getting the same address) I wonder if this is a memory leak?
In any case waiting that long before adding the URB back doesn't seem right.

I don't think I've seen an error that only affected the Bi side before, so
don't know whether the recover behaviour has changed.

FWIW we have identified something 'sub-optimal' with the pcb layout that might
be responsible for noise on the USB data/clock source. However the xhci driver
should recover from such errors.
If the hw guys fix the pcb I'm going to need to keep a failing system!

	David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html