netdev - transmit lockup using smsc95xx ethernet on usb3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 15 Oct 2013 10:59:18 +0100
From:	"David Laight" <David.Laight@...LAB.COM>
To:	<netdev@...r.kernel.org>, <linux-usb@...r.kernel.org>
Subject: transmit lockup using smsc95xx ethernet on usb3

We are seeing complete lockups of the transmit side when using
the smsc95xx driver connected to a USB3 port on an i7 (Ivybridge) cpu.
These errors are very intermittent - less than once a day, and
it isn't actually clear that they are related to traffic load.

Most of the systems are running the 3.2 kernel from Ubuntu 12.04
but I've seen the same problem when running a 3.4 kernel.
Looking at the changelog for xhci-ring.c I can see that some
'nasty' bugs were fixed between 3.2 and 3.4 (and possibly since)
but the usbmon trace I've now got doesn't seem to match any
of the changelog entries.

We are also seeing similar problems if we connect to a USB2
header.

Since we can't reproduce the problem quickly it is difficult to
do any analysis. Any suggestions for increasing the error rate
would be welcome.

Below is an annotated extract from a usbmon trace while running
a netperf test that was sending 8192 byte TCP packets (nagle off).
I've deleted the Bi entries (packets are received throughout)
and numbered all the others (modulo 10000) so it is easier to
see when the requests complete, I've also calculated the elapsed
time and the number of Setup entries between the S and C traces.

The USB ring seems to have 60 outstanding transmits in it,
each time one completes another is sent. There are a few 10000
traces of that then:

  start:9870         ffff88020ea16000 293811125 S Bo:3:003:2 -115 1514 =
                        e2350000 e2450000 22003200 00224d98
                        d8460002 1f0057d7 08004500 05d0ff11
   done:9811:6969:60 ffff88020c7c8000 293811236 C Bo:3:003:2 0 1090 >
  start:9871         ffff88020ea16a80 293811242 S Bo:3:003:2 -115 1090 =
                        3a340000 3a440000 22003200 00224d98
                        d8460002 1f0057d7 08004500 0428ff12
...
  start:9929         ffff88020ea16780 293817964 S Bo:3:003:2 -115 1514 =
                        e2350000 e2450000 22003200 00224d98
                        d8460002 1f0057d7 08004500 05d0ff4c
Last successful completion.
   done:9870:6968:60 ffff88020ea16000 293818093 C Bo:3:003:2 0 1514 >
  start:9930         ffff88020ea16000 293818099 S Bo:3:003:2 -115 1514 =
                        e2350000 e2450000 22003200 00224d98
                        d8460002 1f0057d7 08004500 05d0ff4d

At this point something (untraced) seems to have gone horribly
wrong in the transmit ring. dmesg shows nothing.
Two Bo 'fail -71', 6 succeed, one fails -32 the rest fail -104.
   done:9871:6913:60 ffff88020ea16a80 293818155 C Bo:3:003:2 -71 512 >
   done:9872:6927:59 ffff88020ea16f00 293818235 C Bo:3:003:2 -71 0
   done:9873:6875:58 ffff88020ea16480 293818313 C Bo:3:003:2 0 1514 >
   done:9874:6786:57 ffff88020c7c83c0 293818353 C Bo:3:003:2 0 1514 >
   done:9875:6794:56 ffff88020c7c80c0 293818470 C Bo:3:003:2 0 1514 >
   done:9876:6789:55 ffff88020c7c8e40 293818589 C Bo:3:003:2 0 1514 >
   done:9877:6775:54 ffff88020c7c8240 293818702 C Bo:3:003:2 0 1090 >
   done:9878:6751:53 ffff88020c7c8180 293818803 C Bo:3:003:2 0 1514 >
   done:9879:6735:52 ffff88020c7c89c0 293818885 C Bo:3:003:2 -32 0
   done:9880:6671:51 ffff88020c7c8900 293818925 C Bo:3:003:2 -104 0
...
    done:9927:1292:4 ffff88020cf0c480 293819015 C Bo:3:003:2 -104 0
    done:9928:1170:3 ffff88020ea160c0 293819016 C Bo:3:003:2 -104 0
Something is known to be wrong...
  start:9931         ffff88020ea160c0 293819037 S Co:3:003:0
                         s 02 01 0000 0002 0000 0
    done:9929:1080:3 ffff88020ea16780 293819044 C Bo:3:003:2 -104 0
     done:9930:945:2 ffff88020ea16000 293819044 C Bo:3:003:2 -104 0
      done:9931:48:1 ffff88020ea160c0 293819085 C Co:3:003:0 0 0

These 10 transmits never finish:
  start:9932         ffff88020ea160c0 293819098 S Bo:3:003:2 -115 1090 =
                        3a340000 3a440000 22003200 00224d98
                        d8460002 1f0057d7 08004500 0428ff4e
... 9933 to 9940 deleted
  start:9941         ffff88020ea16b40 293819111 S Bo:3:003:2 -115 1514 =
                        e2350000 e2450000 22003200 00224d98
                        d8460002 1f0057d7 08004500 05d0ff57

All further transmits fail immediately E -12 and generate the
    'xhci_hcd 0000:00:14.0: ERROR no room on ep ring' message.
(There are 1070 'E' traces and 1070 'no room' messages.)
Receives are still working.
  start:9942         ffff88020ea16240 293819113 S Bo:3:003:2 -115 1514 =
                        e2350000 e2450000 22003200 00224d98
                        d8460002 1f0057d7 08004500 05d0ff58
    done:9942:1550:1 ffff88020ea16240 293820663 E Bo:3:003:2 -12 0
  start:9943         ffff88020ea16240 293820675 S Bo:3:003:2 -115 1514 =
                        e2350000 e2450000 22003200 00224d98
                        d8460002 1f0057d7 08004500 05d0ff59
    done:9943:1507:1 ffff88020ea16240 293822182 E Bo:3:003:2 -12 0

Eventually something causes a device remove and insert - everything re-initialises.
This is over 12 hours later.
      done:unknown   ffff88020c8570c0 3637139297 C Ii:3:001:1 0:2048 1 = 02
  start:1015         ffff88020c8570c0 3637139302 S Ii:3:001:1 -115:2048 4 <
  start:1016         ffff88020cbeb300 3637139323 S Ci:3:001:0
                         s a3 00 0000 0001 0004 4 <
                     ffff88020ea16240 3637139331 C Bi:3:003:1 -71 0
       done:1016:9:1 ffff88020cbeb300 3637139332 C Ci:3:001:0 0 4 = 00010100
  start:1017         ffff88020cbeb300 3637139334 S Co:3:001:0
                         s 23 01 0010 0001 0000 0
       done:1017:4:1 ffff88020cbeb300 3637139338 C Co:3:001:0 0 0
      done:unknown   ffff88020ca9ae40 3637139423 C Ii:3:003:3 -71:1 0
                     ffff88020c9db540 3637139428 C Bi:3:003:1 -108 0
                     ffff88020c9db780 3637139430 C Bi:3:003:1 -108 0
                     ffff88020d8bb540 3637139431 C Bi:3:003:1 -108 0

The last 10 transmits then terminate with error -108:
 done:9932:xxxx      ffff88020ea160c0 3637139462 C Bo:3:003:2 -108 0
... 9933 to 9940 deleted
 done:9941:xxxx      ffff88020ea16b40 3637139482 C Bo:3:003:2 -108 0
   done:1015:21090:3 ffff88020c8570c0 3637160392 C Ii:3:001:1 0:2048 1 = 02
  start:1018         ffff88020c8570c0 3637160396 S Ii:3:001:1 -115:2048 4 <
      done:unknown   ffff88020cbf26c0 3637176790 C Ii:3:005:1 -108:8 0
      done:unknown   ffff88020c68aa80 3637622497 C Ii:3:002:1 -108:2048 0

Followed by lots of Ci/Co and eventually it all starts working again.

I've not yet tried to look up the control transfers.

These aren't the only errors we are seeing, we also see (separately):
[21549.917529] hub 3-2:1.0: port 1 disabled by hub (EMI?), re-enabling...
[ 5822.629579] NETDEV WATCHDOG: eth0 (smsc95xx): transmit queue 0 timed out
[ 7263.834404] hid-generic 0003:413C:2005.0002: can't reset device, 0000:00:1a.0-1.4.3/input0, status -71 (connected to a USB2 header).
These all cause a USB bus reset and everything recovers within a couple of seconds.

	David



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html