[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20131017000758.GL7082@xanatos>
Date: Wed, 16 Oct 2013 17:07:58 -0700
From: Sarah Sharp <sarah.a.sharp@...ux.intel.com>
To: David Laight <David.Laight@...LAB.COM>
Cc: netdev@...r.kernel.org, linux-usb@...r.kernel.org,
Xenia Ragiadakou <burzalodowa@...il.com>
Subject: Re: transmit lockup using smsc95xx ethernet on usb3
On Tue, Oct 15, 2013 at 10:59:18AM +0100, David Laight wrote:
> We are seeing complete lockups of the transmit side when using
> the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
> These errors are very intermittent - less than once a day, and
> it isn't actually clear that they are related to traffic load.
>
> Most of the systems are running the 3.2 kernel from Ubuntu 12.04
> but I've seen the same problem when running a 3.4 kernel.
Have you tried the latest stable kernel or the latest -rc kernel?
> Looking at the changelog for xhci-ring.c I can see that some
> 'nasty' bugs were fixed between 3.2 and 3.4 (and possibly since)
> but the usbmon trace I've now got doesn't seem to match any
> of the changelog entries.
>
> We are also seeing similar problems if we connect to a USB2
> header.
Do you mean a USB 2.0 port under the xHCI host controller? What does
`lsusb -t` show as the parent host controller in that case?
> Since we can't reproduce the problem quickly it is difficult to
> do any analysis. Any suggestions for increasing the error rate
> would be welcome.
>
> Below is an annotated extract from a usbmon trace while running
> a netperf test that was sending 8192 byte TCP packets (nagle off).
> I've deleted the Bi entries (packets are received throughout)
> and numbered all the others (modulo 10000) so it is easier to
> see when the requests complete, I've also calculated the elapsed
> time and the number of Setup entries between the S and C traces.
>
> The USB ring seems to have 60 outstanding transmits in it,
> each time one completes another is sent. There are a few 10000
> traces of that then:
>
> start:9870 ffff88020ea16000 293811125 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff11
> done:9811:6969:60 ffff88020c7c8000 293811236 C Bo:3:003:2 0 1090 >
> start:9871 ffff88020ea16a80 293811242 S Bo:3:003:2 -115 1090 =
> 3a340000 3a440000 22003200 00224d98
> d8460002 1f0057d7 08004500 0428ff12
> ...
> start:9929 ffff88020ea16780 293817964 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff4c
> Last successful completion.
> done:9870:6968:60 ffff88020ea16000 293818093 C Bo:3:003:2 0 1514 >
> start:9930 ffff88020ea16000 293818099 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff4d
>
> At this point something (untraced) seems to have gone horribly
> wrong in the transmit ring. dmesg shows nothing.
> Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
> done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 512 >
> done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 0
> done:9873:6875:58 ffff88020ea16480 293818313 C Bo:3:003:2 0 1514 >
> done:9874:6786:57 ffff88020c7c83c0 293818353 C Bo:3:003:2 0 1514 >
> done:9875:6794:56 ffff88020c7c80c0 293818470 C Bo:3:003:2 0 1514 >
> done:9876:6789:55 ffff88020c7c8e40 293818589 C Bo:3:003:2 0 1514 >
> done:9877:6775:54 ffff88020c7c8240 293818702 C Bo:3:003:2 0 1090 >
> done:9878:6751:53 ffff88020c7c8180 293818803 C Bo:3:003:2 0 1514 >
> done:9879:6735:52 ffff88020c7c89c0 293818885 C Bo:3:003:2 -32 0
> done:9880:6671:51 ffff88020c7c8900 293818925 C Bo:3:003:2 -104 0
> ...
> done:9927:1292:4 ffff88020cf0c480 293819015 C Bo:3:003:2 -104 0
> done:9928:1170:3 ffff88020ea160c0 293819016 C Bo:3:003:2 -104 0
> Something is known to be wrong...
> start:9931 ffff88020ea160c0 293819037 S Co:3:003:0
> s 02 01 0000 0002 0000 0
> done:9929:1080:3 ffff88020ea16780 293819044 C Bo:3:003:2 -104 0
> done:9930:945:2 ffff88020ea16000 293819044 C Bo:3:003:2 -104 0
> done:9931:48:1 ffff88020ea160c0 293819085 C Co:3:003:0 0 0
>
> These 10 transmits never finish:
> start:9932 ffff88020ea160c0 293819098 S Bo:3:003:2 -115 1090 =
> 3a340000 3a440000 22003200 00224d98
> d8460002 1f0057d7 08004500 0428ff4e
> ... 9933 to 9940 deleted
> start:9941 ffff88020ea16b40 293819111 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff57
>
> All further transmits fail immediately E -12 and generate the
> 'xhci_hcd 0000:00:14.0: ERROR no room on ep ring' message.
> (There are 1070 'E' traces and 1070 'no room' messages.)
> Receives are still working.
> start:9942 ffff88020ea16240 293819113 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff58
> done:9942:1550:1 ffff88020ea16240 293820663 E Bo:3:003:2 -12 0
> start:9943 ffff88020ea16240 293820675 S Bo:3:003:2 -115 1514 =
> e2350000 e2450000 22003200 00224d98
> d8460002 1f0057d7 08004500 05d0ff59
> done:9943:1507:1 ffff88020ea16240 293822182 E Bo:3:003:2 -12 0
>
> Eventually something causes a device remove and insert - everything re-initialises.
> This is over 12 hours later.
> done:unknown ffff88020c8570c0 3637139297 C Ii:3:001:1 0:2048 1 = 02
> start:1015 ffff88020c8570c0 3637139302 S Ii:3:001:1 -115:2048 4 <
> start:1016 ffff88020cbeb300 3637139323 S Ci:3:001:0
> s a3 00 0000 0001 0004 4 <
> ffff88020ea16240 3637139331 C Bi:3:003:1 -71 0
> done:1016:9:1 ffff88020cbeb300 3637139332 C Ci:3:001:0 0 4 = 00010100
> start:1017 ffff88020cbeb300 3637139334 S Co:3:001:0
> s 23 01 0010 0001 0000 0
> done:1017:4:1 ffff88020cbeb300 3637139338 C Co:3:001:0 0 0
> done:unknown ffff88020ca9ae40 3637139423 C Ii:3:003:3 -71:1 0
> ffff88020c9db540 3637139428 C Bi:3:003:1 -108 0
> ffff88020c9db780 3637139430 C Bi:3:003:1 -108 0
> ffff88020d8bb540 3637139431 C Bi:3:003:1 -108 0
>
> The last 10 transmits then terminate with error -108:
> done:9932:xxxx ffff88020ea160c0 3637139462 C Bo:3:003:2 -108 0
> ... 9933 to 9940 deleted
> done:9941:xxxx ffff88020ea16b40 3637139482 C Bo:3:003:2 -108 0
> done:1015:21090:3 ffff88020c8570c0 3637160392 C Ii:3:001:1 0:2048 1 = 02
> start:1018 ffff88020c8570c0 3637160396 S Ii:3:001:1 -115:2048 4 <
> done:unknown ffff88020cbf26c0 3637176790 C Ii:3:005:1 -108:8 0
> done:unknown ffff88020c68aa80 3637622497 C Ii:3:002:1 -108:2048 0
>
> Followed by lots of Ci/Co and eventually it all starts working again.
>
> I've not yet tried to look up the control transfers.
>
> These aren't the only errors we are seeing, we also see (separately):
> [21549.917529] hub 3-2:1.0: port 1 disabled by hub (EMI?), re-enabling...
> [ 5822.629579] NETDEV WATCHDOG: eth0 (smsc95xx): transmit queue 0 timed out
> [ 7263.834404] hid-generic 0003:413C:2005.0002: can't reset device, 0000:00:1a.0-1.4.3/input0, status -71 (connected to a USB2 header).
> These all cause a USB bus reset and everything recovers within a couple of seconds.
I would suggest you try with the latest stable kernel, with
CONFIG_USB_DEBUG and CONFIG_USB_XHCI_HCD_DEBUGGING enabled. If you try
the latest 3.12-rc, you will only need the CONFIG_USB_DEBUG. Or, if
that output is too much (it will spew on short packets, which may be an
issue with your ethernet adapter), with the 3.12-rc kernel, you can turn
off CONFIG_USB_DEBUG and capture command trace events through the xHCI
trace infrastructure. Xenia can help you with that if necessary (I'm
going to be at a conference starting Friday).
Without that dmesg, I really can't tell what's happening at an xHCI
level.
Sarah Sharp
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists