netdev - Re: [PATCH 0/9 Rev3] Implement batching skb API and support in IPoIB

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070826044134.eabd18cf.billfink@mindspring.com>
Date:	Sun, 26 Aug 2007 04:41:34 -0400
From:	Bill Fink <billfink@...dspring.com>
To:	John Heffner <jheffner@....edu>
Cc:	Rick Jones <rick.jones2@...com>, hadi@...erus.ca,
	David Miller <davem@...emloft.net>, krkumar2@...ibm.com,
	gaagaan@...il.com, general@...ts.openfabrics.org,
	herbert@...dor.apana.org.au, jagana@...ibm.com, jeff@...zik.org,
	johnpol@....mipt.ru, kaber@...sh.net, mcarlson@...adcom.com,
	mchan@...adcom.com, netdev@...r.kernel.org,
	peter.p.waskiewicz.jr@...el.com, rdreier@...co.com,
	Robert.Olsson@...a.slu.se, shemminger@...ux-foundation.org,
	sri@...ibm.com, tgraf@...g.ch, xma@...ibm.com
Subject: Re: [PATCH 0/9 Rev3] Implement batching skb API and support in
 IPoIB

On Fri, 24 Aug 2007, John Heffner wrote:

> Bill Fink wrote:
> > Here you can see there is a major difference in the TX CPU utilization
> > (99 % with TSO disabled versus only 39 % with TSO enabled), although
> > the TSO disabled case was able to squeeze out a little extra performance
> > from its extra CPU utilization.  Interestingly, with TSO enabled, the
> > receiver actually consumed more CPU than with TSO disabled, so I guess
> > the receiver CPU saturation in that case (99 %) was what restricted
> > its performance somewhat (this was consistent across a few test runs).
> 
> One possibility is that I think the receive-side processing tends to do 
> better when receiving into an empty queue.  When the (non-TSO) sender is 
> the flow's bottleneck, this is going to be the case.  But when you 
> switch to TSO, the receiver becomes the bottleneck and you're always 
> going to have to put the packets at the back of the receive queue.  This 
> might help account for the reason why you have both lower throughput and 
> higher CPU utilization -- there's a point of instability right where the 
> receiver becomes the bottleneck and you end up pushing it over to the 
> bad side. :)
> 
> Just a theory.  I'm honestly surprised this effect would be so 
> significant.  What do the numbers from netstat -s look like in the two 
> cases?

Well, I was going to check this out, but I happened to reboot the
system and now I get somewhat different results.

Here are the new results, which should hopefully be more accurate
since they are on a freshly booted system.

TSO enabled and GSO disabled:

[root@...g2 ~]# nuttcp -w10m 192.168.88.16
11610.6875 MB /  10.00 sec = 9735.9526 Mbps 100 %TX 75 %RX

[root@...g2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5029.6875 MB /  10.06 sec = 4194.6931 Mbps 36 %TX 100 %RX

TSO disabled and GSO disabled:

[root@...g2 ~]# nuttcp -w10m 192.168.88.16
11817.9375 MB /  10.00 sec = 9909.7773 Mbps 99 %TX 77 %RX

[root@...g2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5823.3125 MB /  10.00 sec = 4883.2429 Mbps 100 %TX 82 %RX

The TSO disabled case got a little better performance even for
9000 byte jumbo frames.  For the "-M1460" case eumalating a
standard 1500 byte Ethernet MTU, the performance was significantly
better and used less CPU on the receiver (82 % versus 100 %)
although it did use significantly more CPU on the transmitter
(100 % versus 36 %).

TSO disabled and GSO enabled:

[root@...g2 ~]# nuttcp -w10m 192.168.88.16
11609.5625 MB /  10.00 sec = 9734.9859 Mbps 99 %TX 75 %RX

[root@...g2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 5001.4375 MB /  10.06 sec = 4170.6739 Mbps 52 %TX 100 %RX

The GSO enabled case is very similar to the TSO enabled case,
except that for the "-M1460" test the transmitter used more
CPU (52 % versus 36 %), which is to be expected since TSO has
hardware assist.

Here's the beforeafter delta of the receiver's "netstat -s"
statistics for the TSO enabled case:

Ip:
    3659898 total packets received
    3659898 incoming packets delivered
    80050 requests sent out
Tcp:
    2 passive connection openings
    3659897 segments received
    80050 segments send out
TcpExt:
    33 packets directly queued to recvmsg prequeue.
    104956 packets directly received from backlog
    705528 packets directly received from prequeue
    3654842 packets header predicted
    193 packets header predicted and directly queued to user
    4 acknowledgments not containing data received
    6 predicted acknowledgments

And here it is for the TSO disabled case (GSO also disabled):

Ip:
    4107083 total packets received
    4107083 incoming packets delivered
    1401376 requests sent out
Tcp:
    2 passive connection openings
    4107083 segments received
    1401376 segments send out
TcpExt:
    2 TCP sockets finished time wait in fast timer
    48486 packets directly queued to recvmsg prequeue.
    1056111048 packets directly received from backlog
    2273357712 packets directly received from prequeue
    1819317 packets header predicted
    2287497 packets header predicted and directly queued to user
    4 acknowledgments not containing data received
    10 predicted acknowledgments

For the TSO disabled case, there are a huge amount more TCP segments
sent out (1401376 versus 80050), which I assume are ACKs, and which
could possibly contribute to the higher throughput for the TSO disabled
case due to faster feedback, but not explain the lower CPU utilization.
There are many more packets directly queued to recvmsg prequeue
(48486 versus 33).  The numbers for packets directly received from
backlog and prequeue in the TCP disabled case seem bogus to me so
I don't know how to interpret that.  There are only about half as
many packets header predicted (1819317 versus 3654842), but there
are many more packets header predicted and directly queued to user
(2287497 versus 193).  I'll leave the analysis of all this to those
who might actually know what it all means.

I also ran another set of tests that may be of interest.  I changed
the rx-usecs/tx-usecs interrupt coalescing parameter from the
recommended optimum value of 75 usecs to 0 (no coalescing), but
only on the transmitter.  The comparison discussions below are
relative to the previous tests where rx-usecs/tx-usecs were set
to 75 usecs.

TSO enabled and GSO disabled:

[root@...g2 ~]# nuttcp -w10m 192.168.88.16
11812.8125 MB /  10.00 sec = 9905.6640 Mbps 100 %TX 75 %RX

[root@...g2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 7701.8750 MB /  10.00 sec = 6458.5541 Mbps 100 %TX 56 %RX

For 9000 byte jumbo frames it now gets a little better performance
and almost matches the 10-GigE line rate performance of the TSO
disabled case.  For the "-M1460" test, it gets substantially better
performance (6458.5541 Mbps versus 4194.6931 Mbps) at the expense
of much higher transmitter CPU utilization (100 % versus 36 %),
although the receiver CPU utilization is much less (56 % versus 100 %).

TSO disabled and GSO disabled:

[root@...g2 ~]# nuttcp -w10m 192.168.88.16
11817.3125 MB /  10.00 sec = 9909.4058 Mbps 100 %TX 76 %RX

[root@...g2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 4081.2500 MB /  10.00 sec = 3422.3994 Mbps 99 %TX 41 %RX

For 9000 byte jumbo frames the results are essentially the same.
For the "-M1460" test, the performance is significantly worse
(3422.3994 Mbps versus 4883.2429 Mbps) even though the transmitter
CPU utilization is saturated in both cases, but the receiver CPU
utilization is about half (41 % versus 82 %).

TSO disabled and GSO enabled:

[root@...g2 ~]# nuttcp -w10m 192.168.88.16
11813.3750 MB /  10.00 sec = 9906.1090 Mbps 99 %TX 77 %RX

[root@...g2 ~]# nuttcp -M1460 -w10m 192.168.88.16
 3939.1875 MB /  10.00 sec = 3303.2814 Mbps 100 %TX 41 %RX

For 9000 byte jumbo frames the performance is a little better,
again approaching 10-GigE line rate.  But for the "-M1460" test,
the performance is significantly worse (3303.2814 Mbps versus
4170.6739 Mbps) even though the transmitter consumes much more
CPU (100 % versus 52 %).  In this case though the receiver has
a much lower CPU utilization (41 % versus 100 %).

						-Bill
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html