[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080731004123.GB22826@xi.wantstofly.org>
Date: Thu, 31 Jul 2008 02:41:23 +0200
From: Lennert Buytenhek <buytenh@...tstofly.org>
To: David Miller <davem@...emloft.net>
Cc: netdev@...r.kernel.org, akarkare@...vell.com, nico@....org
Subject: Re: using software TSO on non-TSO capable netdevices
On Wed, Jul 30, 2008 at 04:56:21PM -0700, David Miller wrote:
> Thanks for all the great data and testing.
Thanks for taking the time to look at this and replying so quickly!
> > Given this, I'm wondering about the following:
> >
> > 1. Considering the drop in CPU utilisation, are there reasons not
> > to use software GSO on non-hardware-GSO-capable netdevices (apart
> > from GSO possibly confusing tcpdump/iptables/qdiscs/etc)?
>
> We should probably enable software GSO whenever the device can
> do scatter-gather and checksum offload.
OK.
> > 3. Why does dev_hard_start_xmit() get sent 64 KiB segments when the
> > link is in 100 Mb/s mode but gso_segs never grows beyond 3 when
> > the link is in 1000 Mb/s mode?
>
> Because the link can empty the socket send buffer fast enough such
> that there is often not enough data to coalesce into larger GSO frames.
> At least that's my guess.
Hmmmm.
The hacky patch below (on top of 2.6.27-rc1 + stubbing out the
sk_can_gso() check) reduces the 1 GiB 1000 Mb/s sendfile test from:
real 0m16.319s sys 0m13.930s
real 0m15.680s sys 0m14.900s
real 0m15.538s sys 0m10.410s
real 0m15.325s sys 0m8.440s
real 0m16.147s sys 0m12.680s
real 0m15.549s sys 0m12.840s
real 0m15.667s sys 0m13.860s
real 0m15.509s sys 0m14.980s
real 0m15.237s sys 0m10.850s
to:
real 0m14.643s sys 0m3.260s
real 0m14.547s sys 0m3.100s
real 0m14.932s sys 0m3.290s
real 0m14.557s sys 0m3.160s
real 0m14.712s sys 0m3.260s
real 0m14.827s sys 0m3.360s
real 0m14.495s sys 0m3.200s
real 0m14.575s sys 0m3.220s
real 0m14.552s sys 0m3.420s
(I'm sure there's a better way to enforce larger GSO frames, I don't
know the TCP stack too well.)
I.e. dramatic CPU time improvements, and some overall speedup as well.
I wonder if something like this can be done in a less hacky fashion --
the hard part I guess is deciding when to keep coalescing (to reduce
CPU overhead) vs. when to push out what has been coalesced so far (in
order to keep the pipe filled), and I'm not sure I have good ideas
about how to make that decision.
Index: linux-2.6.27-rc1/net/ipv4/tcp_output.c
===================================================================
--- linux-2.6.27-rc1.orig/net/ipv4/tcp_output.c
+++ linux-2.6.27-rc1/net/ipv4/tcp_output.c
@@ -1544,7 +1544,7 @@ static int tcp_write_xmit(struct sock *s
break;
if (tso_segs == 1) {
- if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
+ if (unlikely(!tcp_nagle_test(tp, skb, 5 * mss_now,
(tcp_skb_is_last(sk, skb) ?
nonagle : TCP_NAGLE_PUSH))))
break;
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists