[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Mon, 2 Feb 2009 13:22:12 -0500
From: Neil Horman <nhorman@...driver.com>
To: Eric Dumazet <dada1@...mosbay.com>
Cc: Kenny Chang <kchang@...enacr.com>, netdev@...r.kernel.org
Subject: Re: Multicast packet loss
On Mon, Feb 02, 2009 at 05:57:24PM +0100, Eric Dumazet wrote:
> Neil Horman a écrit :
> > On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote:
> >> Eric Dumazet a écrit :
> >>> Kenny Chang a écrit :
> >>>> Ah, sorry, here's the test program attached.
> >>>>
> >>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> >>>> 2.6.29.-rcX.
> >>>>
> >>>> Right now, we are trying to step through the kernel versions until we
> >>>> see where the performance drops significantly. We'll try 2.6.29-rc soon
> >>>> and post the result.
> >> I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)
> >>
> >> With 8 clients, about 10% packet loss,
> >>
> >> Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
> >> wakeups per second... But at least UDP receive path seems OK.
> >>
> >> Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
> >> readers...
> >>
> >> I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)
> >>
> >> I tried playing with ethtool -C|c G|g params...
> >> And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)
> >>
> >> I can have 0% packet loss if booting with msi_disable and
> >>
> >> echo 1 >/proc/irq/16/smp_affinities
> >>
> >> (16 being interrupt of eth0 NIC)
> >>
> >> then, a second run gave me errors, about 2%, oh well...
> >>
> >>
> >> oprofile numbers without playing IRQ affinities:
> >>
> >> CPU: Core 2, speed 2999.89 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples % symbol name
> >> 327928 10.1427 schedule
> >> 259625 8.0301 mwait_idle
> >> 187337 5.7943 __skb_recv_datagram
> >> 109854 3.3977 lock_sock_nested
> >> 104713 3.2387 tick_nohz_stop_sched_tick
> >> 98831 3.0568 select_nohz_load_balancer
> >> 88163 2.7268 skb_release_data
> >> 78552 2.4296 update_curr
> >> 75241 2.3272 getnstimeofday
> >> 71400 2.2084 set_next_entity
> >> 67629 2.0917 get_next_timer_interrupt
> >> 67375 2.0839 sched_clock_tick
> >> 58112 1.7974 enqueue_entity
> >> 56462 1.7463 udp_recvmsg
> >> 55049 1.7026 copy_to_user
> >> 54277 1.6788 sched_clock_cpu
> >> 54031 1.6712 __copy_skb_header
> >> 51859 1.6040 __slab_free
> >> 51786 1.6017 prepare_to_wait_exclusive
> >> 51776 1.6014 sock_def_readable
> >> 50062 1.5484 try_to_wake_up
> >> 42182 1.3047 __switch_to
> >> 41631 1.2876 read_tsc
> >> 38337 1.1857 tick_nohz_restart_sched_tick
> >> 34358 1.0627 cpu_idle
> >> 34194 1.0576 native_sched_clock
> >> 33812 1.0458 pick_next_task_fair
> >> 33685 1.0419 resched_task
> >> 33340 1.0312 sys_recvfrom
> >> 33287 1.0296 dst_release
> >> 32439 1.0033 kmem_cache_free
> >> 32131 0.9938 hrtimer_start_range_ns
> >> 29807 0.9219 udp_queue_rcv_skb
> >> 27815 0.8603 task_rq_lock
> >> 26875 0.8312 __update_sched_clock
> >> 23912 0.7396 sock_queue_rcv_skb
> >> 21583 0.6676 __wake_up_sync
> >> 21001 0.6496 effective_load
> >> 20531 0.6350 hrtick_start_fair
> >>
> >>
> >>
> >>
> >> With IRQ affinities and msi_disable (no packet drops)
> >>
> >> CPU: Core 2, speed 3000.13 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples % symbol name
> >> 79788 10.3815 schedule
> >> 69422 9.0328 mwait_idle
> >> 44877 5.8391 __skb_recv_datagram
> >> 28629 3.7250 tick_nohz_stop_sched_tick
> >> 27252 3.5459 select_nohz_load_balancer
> >> 24320 3.1644 lock_sock_nested
> >> 20833 2.7107 getnstimeofday
> >> 20666 2.6889 skb_release_data
> >> 18612 2.4217 set_next_entity
> >> 17785 2.3141 get_next_timer_interrupt
> >> 17691 2.3018 udp_recvmsg
> >> 17271 2.2472 sched_clock_tick
> >> 16032 2.0860 copy_to_user
> >> 14785 1.9237 update_curr
> >> 12512 1.6280 prepare_to_wait_exclusive
> >> 12498 1.6262 __slab_free
> >> 11380 1.4807 read_tsc
> >> 11145 1.4501 sched_clock_cpu
> >> 10598 1.3789 __switch_to
> >> 9588 1.2475 pick_next_task_fair
> >> 9480 1.2335 cpu_idle
> >> 9218 1.1994 sys_recvfrom
> >> 9008 1.1721 tick_nohz_restart_sched_tick
> >> 8977 1.1680 dst_release
> >> 8930 1.1619 native_sched_clock
> >> 8392 1.0919 kmem_cache_free
> >> 8124 1.0570 hrtimer_start_range_ns
> >> 7274 0.9464 bnx2_interrupt
> >> 7175 0.9336 __copy_skb_header
> >> 7006 0.9116 try_to_wake_up
> >> 6949 0.9042 sock_def_readable
> >> 6787 0.8831 enqueue_entity
> >> 6772 0.8811 __update_sched_clock
> >> 6349 0.8261 finish_task_switch
> >> 6164 0.8020 copy_from_user
> >> 5096 0.6631 resched_task
> >> 5007 0.6515 sysenter_past_esp
> >>
> >>
> >> I will try to investigate a litle bit more in following days if time permits.
> >>
> > I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
> > affinity when operating in msi interrupt mode. If this is the case with this
> > particular bnx2 card, then I would expect some packet loss, simply due to the
> > constant cache misses. It would be interesting to re-run your oprofile cases,
> > counting L2 cache hits/misses (if your cpu supports that class of counter) for
> > both bnx2 running in msi enabled mode and msi disabled mode. It would also be
> > interesting to use a different card, that can set irq affinity, and compare loss
> > with irqbalance on, and irqbalance off with irq afninty set to all cpus.
>
> booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that
> oprofile results sorted on CPU0 numbers.
>
> We can see scheduler has hard time to cope with this workload with more of two CPUS
>
> OK up to 30.000 (* 8 sockets) packets per second.
>
> CPU0 is 100% handling softirq (ksoftirqd/0)
>
This explains alot. if the application is scheduled to run on the same cpu that
has the irq for the NIC bound to it, you get a perf boost by not having to warm
up two caches (1 for the app cpu and one for the irq & softirq work), but you
loose it and then some fighting for cpu time. If both the app and the irq are
on the same cpu, and we spend so much time in softirq context, we will
eventually overflow higher up the network stack, as the application doesn't have
enough time to dequeue frames.
It may also speak to the need to make the bnx2 napi routine more efficient :)
Neil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists