lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Fri, 28 Sep 2007 14:31:01 -0700
From:	Rick Jones <rick.jones2@...com>
To:	Jay Vosburgh <fubar@...ibm.com>
Cc:	Linux Network Development list <netdev@...r.kernel.org>
Subject: Re: error(s) in 2.6.23-rc5 bonding.txt ?

Well, I managed to concoct an updated test, this time with 1G's going into a 
10G.  A 2.6.23-rc8 kernel on the system with four, dual-port 82546GB's, 
connected to an HP ProCurve 3500 series switch with a 10G link to a system 
running 2.6.18-8.el5 (I was having difficulty getting cxgb3 going on my 
kernel.org kernels - firmware mismatches - so I booted RHEL5 there).

I put all four 1G interfaces into a balance_rr (mode=0) bond and started running 
just a single netperf TCP_STREAM test.

On the bonding side:

hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
     19050 segments retransmited
     9349 fast retransmits
     9698 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
           RX packets:50708119 errors:0 dropped:0 overruns:0 frame:0
           TX packets:58801285 errors:0 dropped:0 overruns:0 carrier:0
hpcpc103:~/net-2.6.24/Documentation/networking# netperf -H 192.168.5.106
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.106 
(192.168.5.106) port 0 AF_INET
Recv   Send    Send
Socket Socket  Message  Elapsed
Size   Size    Size     Time     Throughput
bytes  bytes   bytes    secs.    10^6bits/sec

  87380  16384  16384    10.01    1267.99
hpcpc103:~/net-2.6.24/Documentation/networking# netstat -s -t | grep retran
     20268 segments retransmited
     9974 fast retransmits
     10291 forward retransmits
hpcpc103:~/net-2.6.24/Documentation/networking# ifconfig bond0 | grep pack
           RX packets:51636421 errors:0 dropped:0 overruns:0 frame:0
           TX packets:59899089 errors:0 dropped:0 overruns:0 carrier:0

on the recieving side:

[root@...pc106 ~]# ifconfig eth5 | grep pack
           RX packets:58802455 errors:0 dropped:0 overruns:0 frame:0
           TX packets:50205304 errors:0 dropped:0 overruns:0 carrier:0
[root@...pc106 ~]# ifconfig eth5 | grep pack
           RX packets:59900267 errors:0 dropped:0 overruns:0 frame:0
           TX packets:51124138 errors:0 dropped:0 overruns:0 carrier:0

So, there were  20268 - 19050  or 1218 retransmissions during the test.  The 
sending side reported sending 59899089 - 58801285 or 1097804 packets, and the 
receiver reported receiving 59900267 - 58802455 or 1097812 packets.

Unless the switch was only occasionally duplicating segments or something, it 
looks like all the retransmissions were the result of duplicate ACKs from packet 
reordering.

For grins I varied the "reordering" sysctl and got:

# netstat -s -t | grep retran; for i in 3 4 5 6 7 8 9 10 20 30; do sysctl -w 
net.ipv4.tcp_reordering=$i; netperf -H 192.168.5.106 -P 0 -B "reorder $i"; 
netstat -s -t | grep retran; done
     13735 segments retransmited
     6581 fast retransmits
     7151 forward retransmits
net.ipv4.tcp_reordering = 3
  87380  16384  16384    10.01    1294.51   reorder 3
     15127 segments retransmited
     7330 fast retransmits
     7794 forward retransmits
net.ipv4.tcp_reordering = 4
  87380  16384  16384    10.01    1304.22   reorder 4
     16103 segments retransmited
     7807 fast retransmits
     8293 forward retransmits
net.ipv4.tcp_reordering = 5
  87380  16384  16384    10.01    1330.88   reorder 5
     16763 segments retransmited
     8155 fast retransmits
     8605 forward retransmits
net.ipv4.tcp_reordering = 6
  87380  16384  16384    10.01    1350.50   reorder 6
     17134 segments retransmited
     8356 fast retransmits
     8775 forward retransmits
net.ipv4.tcp_reordering = 7
  87380  16384  16384    10.01    1353.00   reorder 7
     17492 segments retransmited
     8553 fast retransmits
     8936 forward retransmits
net.ipv4.tcp_reordering = 8
  87380  16384  16384    10.01    1358.00   reorder 8
     17649 segments retransmited
     8625 fast retransmits
     9021 forward retransmits
net.ipv4.tcp_reordering = 9
  87380  16384  16384    10.01    1415.89   reorder 9
     17736 segments retransmited
     8666 fast retransmits
     9067 forward retransmits
net.ipv4.tcp_reordering = 10
  87380  16384  16384    10.01    1412.36   reorder 10
     17773 segments retransmited
     8684 fast retransmits
     9086 forward retransmits
net.ipv4.tcp_reordering = 20
  87380  16384  16384    10.01    1403.47   reorder 20
     17773 segments retransmited
     8684 fast retransmits
     9086 forward retransmits
net.ipv4.tcp_reordering = 30
  87380  16384  16384    10.01    1325.41   reorder 30
     17773 segments retransmited
     8684 fast retransmits
     9086 forward retransmits

IE fast retrans from reordering until the reorder limit was reasonably well 
above the number of links in the aggregate.

As for how things got reordered, knuth knows exactly why.  But it didn't need 
more than one connection, and that connection didn't have to vary the size of 
what it was passing to send(). Netperf was not making send calls which were an 
integral multiple of the MSS, which means that from time to time a short segment 
would be queued to an interface in the bond. Also, two of the dual-port NICs 
were on 66 MHz  PCI-X busses, and the other two were on 133MHz PCI-X busses 
(four busses in all) so the DMA times will have differed.


And as if this mail wasn't already long enough, here is some tcptrace summary 
for the netperf data connection with reorder at 3:

================================
TCP connection 2:
         host c:        192.168.5.103:52264
         host d:        192.168.5.106:33940
         complete conn: yes
         first packet:  Fri Sep 28 14:06:43.271692 2007
         last packet:   Fri Sep 28 14:06:53.277018 2007
         elapsed time:  0:00:10.005326
         total packets: 1556191
         filename:      trace
    c->d:                              d->c:
      total packets:        699400           total packets:        856791
      ack pkts sent:        699399           ack pkts sent:        856791
      pure acks sent:            2           pure acks sent:       856789
      sack pkts sent:            0           sack pkts sent:       352480
      dsack pkts sent:           0           dsack pkts sent:         948
      max sack blks/ack:         0           max sack blks/ack:         3
      unique bytes sent: 1180423912           unique bytes sent:         0
      actual data pkts:     699397           actual data pkts:          0
      actual data bytes: 1180581744           actual data bytes:         0
      rexmt data pkts:         106           rexmt data pkts:           0
      rexmt data bytes:     157832           rexmt data bytes:          0
      zwnd probe pkts:           0           zwnd probe pkts:           0
      zwnd probe bytes:          0           zwnd probe bytes:          0
      outoforder pkts:      202461           outoforder pkts:           0
      pushed data pkts:       6057           pushed data pkts:          0
      SYN/FIN pkts sent:       1/1           SYN/FIN pkts sent:       1/1
      req 1323 ws/ts:          Y/Y           req 1323 ws/ts:          Y/Y
      adv wind scale:            7           adv wind scale:            9
      req sack:                  Y           req sack:                  Y
      sacks sent:                0           sacks sent:           352480
      urgent data pkts:          0 pkts      urgent data pkts:          0 pkts
      urgent data bytes:         0 bytes     urgent data bytes:         0 bytes
      mss requested:          1460 bytes     mss requested:          1460 bytes
      max segm size:          8688 bytes     max segm size:             0 bytes
      min segm size:             8 bytes     min segm size:             0 bytes
      avg segm size:          1687 bytes     avg segm size:             0 bytes
      max win adv:            5888 bytes     max win adv:          968704 bytes
      min win adv:            5888 bytes     min win adv:            8704 bytes
      zero win adv:              0 times     zero win adv:              0 times
      avg win adv:            5888 bytes     avg win adv:          798088 bytes
      initial window:         2896 bytes     initial window:            0 bytes
      initial window:            2 pkts      initial window:            0 pkts
      ttl stream length: 1577454360 bytes     ttl stream length:         0 bytes
      missed data:       397030448 bytes     missed data:               0 bytes
      truncated data:    1159600134 bytes     truncated data:            0 bytes
      truncated packets:    699383 pkts      truncated packets:         0 pkts
      data xmit time:       10.005 secs      data xmit time:        0.000 secs
      idletime max:            7.5 ms        idletime max:            7.4 ms
      throughput:        117979555 Bps       throughput:                0 Bps

This was taken at the receiving 10G NIC.

rick jones
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ