[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4A93B34E.1040100@gmail.com>
Date: Tue, 25 Aug 2009 11:47:58 +0200
From: Eric Dumazet <eric.dumazet@...il.com>
To: Stephen Hemminger <shemminger@...tta.com>
CC: David Miller <davem@...emloft.net>, netdev@...r.kernel.org,
Robert Olsson <robert.olsson@....uu.se>
Subject: Re: Kernel forwarding performance test regressions
Stephen Hemminger a écrit :
> Vyatta regularly runs RFC2544 performance tests as part of
> the QA release regression tests. These tests are run using
> a Spirent analyzer that sends packets at maximum rate and
> measures the number of packets received.
>
> The interesting (worst case) number is the forwarding percentage for
> minimum size Ethernet packets. For packets 1K and above all the packets
> get through but for smaller sizes the system can't keep up.
>
> The hardware is Dell based
> CPU is Intel Dual Core E2220 @ 2.40GHz (or 2.2GHz)
> NIC's are internal Broadcom (tg3).
>
> Size 2.6.23 2.6.24 2.6.26 2.6.29 2.6.30
> 64 14.% 20% 21% 17% 19%
> 128 22 33 34 28 32
> 256 37 52 58 49 54
> 512 67 85 83 85 85
> 1024 100 100 100 100 100
> 1280 100 100 100 100 100
> 1518 100 100 100 100 100
>
>
> Some other details:
> * Hardware change between 2.6.24 -> 2.6.26 numbers
> went from 2.2 to 2.4Ghz
>
> * no SMP affinity (or irqbalance) is done,
> numbers are significantly better if IRQ's are pinned.
> 2.6.26 goes from 20% to 32%
Thats strange, because at Giga flood level, we should be on NAPI mode,
ksoftirqd using 100% of one cpu. SMP affinities should not matter at all...
>
> * unidirectional numbers are 2X the bidirectional numbers:
> 2.6.26 goes from 20% to 40%
>
> * this is single stream (doesn't help/use multiqueue)
>
> * system loads iptables but does not use it, so each packet
> sees the overhead of null rules.
>
> So kernel 2.6.29 had an observable dip in performance
> which seems to be mostly recovered in 2.6.30.
>
> These are from our QA, not me so please don't ask me for
> "please rerun with XX enabled", go run the same test
> yourself with pktgen.
>
Unfortunatly I cannot reach line-rate with pktgen and small packets.
(Limit ~1012333pps 485Mb/sec on my test machine, 3GHz E5450 cpu)
It seems timestamping is too expensive on pktgen, even for "delay 0"
and only one device setup (next_to_run() doesnt have to select the 'best' device)
We probably can improve pktgen a litle bit, or use a faster timestamping...
oprofile results on pktgen machine (linux 2.6.30.5) :
CPU: Core 2, speed 3000.08 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples cum. samples % cum. % symbol name
58137 58137 27.9549 27.9549 read_tsc
51487 109624 24.7573 52.7122 pktgen_thread_worker
33079 142703 15.9059 68.6181 getnstimeofday
15694 158397 7.5464 76.1645 getCurUs
11806 170203 5.6769 81.8413 do_gettimeofday
5852 176055 2.8139 84.6553 kthread_should_stop
5244 181299 2.5216 87.1768 kthread
4181 185480 2.0104 89.1872 mwait_idle
3837 189317 1.8450 91.0322 consume_skb
2217 191534 1.0660 92.0983 skb_dma_unmap
1599 193133 0.7689 92.8671 skb_dma_map
1389 194522 0.6679 93.5350 local_bh_enable_ip
1350 195872 0.6491 94.1842 nommu_map_page
1086 196958 0.5222 94.7064 mix_pool_bytes_extract
835 197793 0.4015 95.1079 apic_timer_interrupt
774 198567 0.3722 95.4801 irq_entries_start
450 199017 0.2164 95.6964 timer_stats_update_stats
404 199421 0.1943 95.8907 scheduler_tick
403 199824 0.1938 96.0845 find_busiest_group
336 200160 0.1616 96.2460 local_bh_disable
332 200492 0.1596 96.4057 rb_get_reader_page
329 200821 0.1582 96.5639 ring_buffer_consume
267 201088 0.1284 96.6923 add_timer_randomness
I experiment 0.1% drops around 635085pps 284Mb/sec, on my dev machine
(using vlan and bonding, bi-directional , output device = input device)
Some notes :
- Small packets hit the copybreak (mis)feature (that tg3 and other drivers use),
and we know this slow down forwarding. No real differences on small
packets anyway since we need to read packet to process it (one cache line)
- neigh_resolve_output() has a cost because
of atomic ops of read_lock_bh(&neigh->lock)/read_unlock_bh(&neigh->lock)
This might be a candidate for RCU conversion ?
- ip_rt_send_redirect() is quite expensive, even if send_redirect is set to 0, because
of in_dev_get()/in_dev_put() (two atomic ops that could be avoided : I submitted a patch)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists