netdev - using software TSO on non-TSO capable netdevices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Date:	Thu, 31 Jul 2008 01:50:04 +0200
From:	Lennert Buytenhek <buytenh@...tstofly.org>
To:	netdev@...r.kernel.org
Cc:	Ashish Karkare <akarkare@...vell.com>, Nicolas Pitre <nico@....org>
Subject: using software TSO on non-TSO capable netdevices

Hi,

I've been doing some network throughput tests with a NIC (mv643xx_eth)
that does not support TSO/GSO in hardware.  The host CPU is an ARM CPU
that is pretty fast as far as ARM CPUs go (1.2 GHz), but not so fast
when compared to x86s.

When using sendfile() to send a GiB worth of zeroes over a single TCP
connection to another host on a 100 Mb/s network, with a vanilla
2.6.27-rc1 kernel, this runs as expected at wire speed, taking the
following amount of CPU time per test:

	sys     0m5.410s
	sys     0m5.380s
	sys     0m5.620s
	sys     0m5.360s


With this patch:

	Index: linux-2.6.27-rc1/include/net/sock.h
	===================================================================
	--- linux-2.6.27-rc1.orig/include/net/sock.h
	+++ linux-2.6.27-rc1/include/net/sock.h
	@@ -1085,7 +1085,8 @@ extern struct dst_entry *sk_dst_check(st
	 
	 static inline int sk_can_gso(const struct sock *sk)
	 {
	-	return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
	+//	return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
	+	return 1;
	 }
	 
	 extern void sk_setup_caps(struct sock *sk, struct dst_entry *dst);

The CPU utilisation numbers drop to:

	sys     0m3.280s
	sys     0m3.230s
	sys     0m3.220s
	sys     0m3.350s

Putting some debug code in net/core/dev.c:dev_hard_start_xmit(), I can
see that pretty much all of the segments that enter there to be GSOd in
software are full-sized (64 KiB-ish).


When the ethernet link is in 1000 Mb/s mode, the test seems CPU-bound,
and things look a little different.  With vanilla 2.6.27-rc1, I get
these numbers for the same 1 GiB sendfile() test, where real time ~=
sys time:

	sys     0m18.200s
	sys     0m18.260s
	sys     0m17.830s
	sys     0m17.670s
	sys     0m17.840s
	sys     0m17.670s
	sys     0m17.300s
	sys     0m17.860s
	sys     0m18.260s
	sys     0m17.150s
	sys     0m17.950s

With the patch above applied once again, I get:

	real    0m16.319s       sys     0m13.930s
	real    0m15.680s       sys     0m14.900s
	real    0m15.538s       sys     0m10.410s
	real    0m15.325s       sys     0m8.440s
	real    0m16.147s       sys     0m12.680s
	real    0m15.549s       sys     0m12.840s
	real    0m15.667s       sys     0m13.860s
	real    0m15.509s       sys     0m14.980s
	real    0m15.237s       sys     0m10.850s

While the wall clock time isn't much improved (hitting some kind of
internal bus bandwidth or DMA latency limitation in the hardware?),
the system time is improved, although the improvement is jittery.


In general, when the link is at 1000 Mb/s, skb_shinfo(skb)->gso_segs
of 99.99% of the skbs sent to net/core/dev.c:dev_hard_start_xmit()
is either 2 or 3 in dev_hard_start_xmit() (which seems to be cwnd
limited), unlike the 44 I see when the link is in 100 Mb/s mode.

I.e. with the patch below, 100 Mb/s, the output during steady state
is always something like this, i.e. skb_shinfo(skb)->gso_segs is
always 44:

	Jul 31 00:12:59 kw kernel: 10k seg: 44:10000
	Jul 31 00:12:59 kw kernel: 10k size: 127:10000
	Jul 31 00:13:00 kw kernel: 10k seg: 44:10000
	Jul 31 00:13:00 kw kernel: 10k size: 127:10000
	Jul 31 00:13:02 kw kernel: 10k seg: 44:10000
	Jul 31 00:13:02 kw kernel: 10k size: 127:10000
	Jul 31 00:13:04 kw kernel: 10k seg: 44:10000
	Jul 31 00:13:04 kw kernel: 10k size: 127:10000
	Jul 31 00:13:05 kw kernel: 10k seg: 44:10000
	Jul 31 00:13:05 kw kernel: 10k size: 127:10000

With the same patch, 1000 Mb/s, the output is something like this (the
2-seg:3-seg ratio varies between runs but is typically pretty constant
within the same run, this is from one particular run):

	Jul 31 00:57:56 kw kernel: 10k seg: 2:4592 3:5408 
	Jul 31 00:57:56 kw kernel: 10k size: 5:4592 8:5408 
	Jul 31 00:57:56 kw kernel: 10k seg: 2:4513 3:5487 
	Jul 31 00:57:56 kw kernel: 10k size: 5:4513 8:5487 
	Jul 31 00:57:57 kw kernel: 10k seg: 2:4575 3:5425 
	Jul 31 00:57:57 kw kernel: 10k size: 5:4575 8:5425 
	Jul 31 00:57:58 kw kernel: 10k seg: 2:4569 3:5431 
	Jul 31 00:57:58 kw kernel: 10k size: 5:4569 8:5431 
	Jul 31 00:57:58 kw kernel: 10k seg: 2:4581 3:5419 
	Jul 31 00:57:58 kw kernel: 10k size: 5:4581 8:5419
	Jul 31 00:57:59 kw kernel: 10k seg: 2:4583 3:5417
	Jul 31 00:57:59 kw kernel: 10k size: 5:4583 8:5417


Given this, I'm wondering about the following:

1. Considering the drop in CPU utilisation, are there reasons not
   to use software GSO on non-hardware-GSO-capable netdevices (apart
   from GSO possibly confusing tcpdump/iptables/qdiscs/etc)?

2. Why is the number of cycles necessary to send 1 GiB of data so
   much higher (~3.5x higher) in 1000 Mb/s mode than in 100 Mb/s mode?
   (Is this maybe just because time(1) is inaccurate w.r.t. time spent
   in interrupts and such?)

3. Why does dev_hard_start_xmit() get sent 64 KiB segments when the
   link is in 100 Mb/s mode but gso_segs never grows beyond 3 when
   the link is in 1000 Mb/s mode?


Any more thoughts about this or things I can try?  Any other ideas
to speed up the 1000 Mb/s case?


thanks,
Lennert




Index: linux-2.6.27-rc1/net/core/dev.c
===================================================================
--- linux-2.6.27-rc1.orig/net/core/dev.c
+++ linux-2.6.27-rc1/net/core/dev.c
@@ -1633,6 +1633,58 @@ int dev_hard_start_xmit(struct sk_buff *
 	}
 
 gso:
+	if (1) {
+		static int samples;
+		static int segment_histo[45];
+		int segments = 0;
+
+		segments = skb_shinfo(skb)->gso_segs;
+		if (segments > 44)
+			segments = 44;
+		segment_histo[segments]++;
+
+		if (++samples == 10000) {
+			int i;
+
+			samples = 0;
+
+			printk(KERN_CRIT "10k seg: ");
+			for (i = 0; i < 45; i++) {
+				if (segment_histo[i]) {
+					printk("%d:%d ", i, segment_histo[i]);
+					segment_histo[i] = 0;
+				}
+			}
+			printk("\n");
+		}
+	}
+
+	if (1) {
+		static int samples;
+		static int size_histo[150];
+		int len = 0;
+
+		len = skb->len >> 9;
+		if (len > 149)
+			len = 149;
+		size_histo[len]++;
+
+		if (++samples == 10000) {
+			int i;
+
+			samples = 0;
+
+			printk(KERN_CRIT "10k size: ");
+			for (i = 0; i < 150; i++) {
+				if (size_histo[i]) {
+					printk("%d:%d ", i, size_histo[i]);
+					size_histo[i] = 0;
+				}
+			}
+			printk("\n");
+		}
+	}
+
 	do {
 		struct sk_buff *nskb = skb->next;
 		int rc;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html