netdev - Re: bonding forwarding perf issues in 2.6.32.7 & 2.6.29.6

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.1002070616470.15390@nacho.alt.net>
Date:	Sun, 7 Feb 2010 06:52:00 +0000 (UTC)
From:	Chris Caputo <ccaputo@....net>
To:	Jay Vosburgh <fubar@...ibm.com>
cc:	bonding-devel@...ts.sourceforge.net, netdev@...r.kernel.org
Subject: Re: bonding forwarding perf issues in 2.6.32.7 & 2.6.29.6

On Sat, 6 Feb 2010, Jay Vosburgh wrote:
> Chris Caputo <ccaputo@....net> wrote:
> >Kernel 2.6.32.7 (and 2.6.32.5 & 2.6.29.6) on a 2x Intel Xeon E5420 
> >(Quad-Core 2.5Ghz), SuperMicro X7DBE+, 32GB (16 * 2GB) DDR2-667MHz.
> >
> >I have a router with a variety of e1000 and e1000e based interfaces.
> >
> >bond0 is a 2xGigE (82571EB) with two active slaves.
> >
> >bond1 has up to 3 slaves (2x 80003ES2LAN/82563, 82546EB).
> >
> >Both are configured with miimon=100, balance-xor, layer3+4.
> >
> >When bond1 has just a single active slave, outbound (and possibly inbound) 
> >forwarding performance on bond1 is better than when it has two or three 
> >active slaves.  Ie., when I activate the second slave, by enabling the 
> >port on the switch it is connected to, forwarding performance drops 
> >dramatically across the full bond1.
> 
> 	What exactly do you mean by "forwarding performance drops
> dramatically"?  How are you measuring this?

I have TCP flows continuously coming in through this router to internal 
servers.

On a second by second basis, parsing ifconfig output as an example, I can 
see the flow rates through the router, ex:

  RX: 360 mbits/sec  TX: 462 mbits/sec
  RX: 350 mbits/sec  TX: 527 mbits/sec
  RX: 361 mbits/sec  TX: 462 mbits/sec
  [...]
	
When I go from a single GigE slave to 2x or 3xGigE, there is a noticeable 
drop in throughput.  That could be explained by decreased retransmits due 
to less packet loss on a less congested link, but I am able to discern 
that is not happening based on how the internal servers store the data.  
(They receive the data, and then store the data to storage servers using 
another NIC.)

As a demonstration, when I had 3xGigE bond1 going on the router, 
throughput on one of the storage server was as follows:

  [10 second averages]
  RX: 207 mbits/sec  TX: 3 mbits/sec
  RX: 206 mbits/sec  TX: 3 mbits/sec
  RX: 208 mbits/sec  TX: 3 mbits/sec
  RX: 202 mbits/sec  TX: 3 mbits/sec
  RX: 208 mbits/sec  TX: 3 mbits/sec
  RX: 202 mbits/sec  TX: 3 mbits/sec
  RX: 197 mbits/sec  TX: 3 mbits/sec

When I then disabled all but one of the GigE's for bond1 on the router, 
the release of back-pressure on the incoming TCP flows was immediately 
visible through increased writes to this storage server:

  [10 second averages]
  RX: 144 mbits/sec  TX: 2 mbits/sec
  RX: 355 mbits/sec  TX: 6 mbits/sec
  RX: 387 mbits/sec  TX: 7 mbits/sec
  RX: 325 mbits/sec  TX: 6 mbits/sec
  RX: 365 mbits/sec  TX: 6 mbits/sec
  RX: 317 mbits/sec  TX: 5 mbits/sec
  RX: 318 mbits/sec  TX: 5 mbits/sec

  (I think the dip to 144 mbits was the result of the NIC status changes.)

This is repeatable, and going the other way (GigE -> 3xGigE) also shows a 
visible drop in throughput.

Also, I tried balance-rr, rather than balance-xor, and that didn't help.

I would suspect motherboard bus limitations, except that I am able run 
netperf unidirectional UDP tests that on a round-robin 3xGigE result in 
more than 800 mbps on each interface, which is far more than the TCP flows 
that appear to have back-pressure when I engage bonding.

> 	Also, just to confirm, are the switch ports connected to the
> respective bonds also grouped on the switch?  The balance-xor mode is
> meant to interop with an Etherchannel compatible switch port
> aggregation.

Yes, the switch is an HP2848 with the 3 GigE's configured as a trunk.

> >Locally originated packets do not seem to be harmed by the second GigE 
> >coming online.  From what I have observed, the issue is with forwarding.  
> >The majority of the forwarding traffic is coming in on bond0 and egressing 
> >on bond1.
> 
> 	Perhaps it has something to do with forwarding causing LRO to be
> disabled.

All three interfaces have LRO off:

  rx-checksumming: on
  tx-checksumming: on
  scatter-gather: on
  tcp-segmentation-offload: off
  udp-fragmentation-offload: off
  generic-segmentation-offload: on
  generic-receive-offload: off
  large-receive-offload: off

Thanks,
Chris

> 	-J
> 
> >I have tried changing IRQ binding in a variety of ways (same CPU, same 
> >core, different cores, paired based on bond, irqbalance) and it hasn't 
> >helped.
> >
> >I have tried having one of bond1's GigEs be on a separate bus with a 
> >separate NIC, to no avail.
> >
> >Oprofiling (data below) does not reveal much time is being spent in the 
> >bonding driver.  bond_start_xmit() is the peak for the bonding driver, at 
> >less than 1% regardless of how many interfaces are bound.
> >
> >Does anyone have any tips on how I should try to narrow down this further?
> >
> >Thanks,
> >Chris
> >
> >---
> >
> >bond1 with just one 80003ES2LAN/82563 active:
> >
> >samples  %        image name               app name                 symbol name
> >114103   13.4161  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ipt_do_table
> >24447     2.8745  e1000e.ko                e1000e.ko                e1000_xmit_frame
> >23687     2.7851  vmlinux-2.6.32.7         vmlinux-2.6.32.7         dev_queue_xmit
> >19088     2.2444  e1000e.ko                e1000e.ko                e1000_clean_tx_irq
> >18820     2.2128  vmlinux-2.6.32.7         vmlinux-2.6.32.7         skb_copy_bits
> >16028     1.8846  vmlinux-2.6.32.7         vmlinux-2.6.32.7         skb_segment
> >15013     1.7652  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __slab_free
> >14187     1.6681  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __slab_alloc
> >13649     1.6048  vmlinux-2.6.32.7         vmlinux-2.6.32.7         mwait_idle
> >13177     1.5493  e1000e.ko                e1000e.ko                e1000_irq_enable
> >13017     1.5305  bgpd                     bgpd                     bgp_process_announce_selected
> >12242     1.4394  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __alloc_skb
> >11186     1.3152  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ip_vs_in
> >11054     1.2997  vmlinux-2.6.32.7         vmlinux-2.6.32.7         find_vma
> >10861     1.2770  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ip_rcv
> >10724     1.2609  vmlinux-2.6.32.7         vmlinux-2.6.32.7         nf_iterate
> >10659     1.2533  vmlinux-2.6.32.7         vmlinux-2.6.32.7         kmem_cache_alloc
> >
> >bond1 with a 80003ES2LAN/82563 and a 82546EB active:
> >
> >samples  %        image name               app name                 symbol name
> >36249    14.1261  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ipt_do_table
> >5985      2.3323  vmlinux-2.6.32.7         vmlinux-2.6.32.7         skb_copy_bits
> >5731      2.2333  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __slab_free
> >5496      2.1418  e1000.ko                 e1000.ko                 e1000_clean
> >5489      2.1390  vmlinux-2.6.32.7         vmlinux-2.6.32.7         dev_queue_xmit
> >5247      2.0447  vmlinux-2.6.32.7         vmlinux-2.6.32.7         mwait_idle
> >5090      1.9835  e1000e.ko                e1000e.ko                e1000_xmit_frame
> >5025      1.9582  e1000e.ko                e1000e.ko                e1000_irq_enable
> >4777      1.8616  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __slab_alloc
> >4714      1.8370  e1000e.ko                e1000e.ko                e1000_clean_tx_irq
> >4102      1.5985  e1000.ko                 e1000.ko                 e1000_intr
> >4004      1.5603  vmlinux-2.6.32.7         vmlinux-2.6.32.7         skb_segment
> >3924      1.5292  e1000e.ko                e1000e.ko                e1000_intr_msi
> >3867      1.5070  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __alloc_skb
> >3424      1.3343  e1000.ko                 e1000.ko                 e1000_xmit_frame
> >3225      1.2568  vmlinux-2.6.32.7         vmlinux-2.6.32.7         find_vma
> >3148      1.2268  vmlinux-2.6.32.7         vmlinux-2.6.32.7         kfree
> >
> >bond1 with 2x 80003ES2LAN/82563 active and a 82546EB active:
> >
> >samples  %        image name               app name                 symbol name
> >28124    14.5651  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ipt_do_table
> >5725      2.9649  e1000e.ko                e1000e.ko                e1000_irq_enable
> >5077      2.6293  vmlinux-2.6.32.7         vmlinux-2.6.32.7         mwait_idle
> >4374      2.2652  vmlinux-2.6.32.7         vmlinux-2.6.32.7         skb_copy_bits
> >4277      2.2150  e1000e.ko                e1000e.ko                e1000_intr_msi
> >4224      2.1876  e1000e.ko                e1000e.ko                e1000_xmit_frame
> >3863      2.0006  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __slab_free
> >3826      1.9814  e1000e.ko                e1000e.ko                e1000_clean_tx_irq
> >3682      1.9069  vmlinux-2.6.32.7         vmlinux-2.6.32.7         dev_queue_xmit
> >3512      1.8188  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __slab_alloc
> >3191      1.6526  e1000.ko                 e1000.ko                 e1000_clean
> >3042      1.5754  e1000.ko                 e1000.ko                 e1000_intr
> >2540      1.3154  vmlinux-2.6.32.7         vmlinux-2.6.32.7         __alloc_skb
> >2425      1.2559  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ip_rcv
> >2406      1.2460  vmlinux-2.6.32.7         vmlinux-2.6.32.7         skb_segment
> >2333      1.2082  vmlinux-2.6.32.7         vmlinux-2.6.32.7         nf_iterate
> >2329      1.2062  vmlinux-2.6.32.7         vmlinux-2.6.32.7         ip_vs_in
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html