netdev - Re: Multicast packet loss

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 2 Feb 2009 13:22:12 -0500
From:	Neil Horman <nhorman@...driver.com>
To:	Eric Dumazet <dada1@...mosbay.com>
Cc:	Kenny Chang <kchang@...enacr.com>, netdev@...r.kernel.org
Subject: Re: Multicast packet loss

On Mon, Feb 02, 2009 at 05:57:24PM +0100, Eric Dumazet wrote:
> Neil Horman a écrit :
> > On Sun, Feb 01, 2009 at 01:40:39PM +0100, Eric Dumazet wrote:
> >> Eric Dumazet a écrit :
> >>> Kenny Chang a écrit :
> >>>> Ah, sorry, here's the test program attached.
> >>>>
> >>>> We've tried 2.6.28.1, but no, we haven't tried the 2.6.28.2 or the
> >>>> 2.6.29.-rcX.
> >>>>
> >>>> Right now, we are trying to step through the kernel versions until we
> >>>> see where the performance drops significantly.  We'll try 2.6.29-rc soon
> >>>> and post the result.
> >> I tried your program on my dev machines and 2.6.29 (each machine : two quad core cpus, 32bits kernel)
> >>
> >> With 8 clients, about 10% packet loss, 
> >>
> >> Might be a scheduling problem, not sure... 50.000 packets per second, x 8 cpus = 400.000
> >> wakeups per second... But at least UDP receive path seems OK.
> >>
> >> Thing is the receiver (softirq that queues the packet) seems to fight on socket lock with
> >> readers...
> >>
> >> I tried to setup IRQ affinities, but it doesnt work any more on bnx2 (unless using msi_disable=1)
> >>
> >> I tried playing with ethtool -C|c G|g params...
> >> And /proc/net/core/rmem_max (and setsockopt(RCVBUF) to set bigger receive buffers in your program)
> >>
> >> I can have 0% packet loss if booting with msi_disable and
> >>
> >> echo 1 >/proc/irq/16/smp_affinities
> >>
> >> (16 being interrupt of eth0 NIC)
> >>
> >> then, a second run gave me errors, about 2%, oh well...
> >>
> >>
> >> oprofile numbers without playing IRQ affinities:
> >>
> >> CPU: Core 2, speed 2999.89 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples  %        symbol name
> >> 327928   10.1427  schedule
> >> 259625    8.0301  mwait_idle
> >> 187337    5.7943  __skb_recv_datagram
> >> 109854    3.3977  lock_sock_nested
> >> 104713    3.2387  tick_nohz_stop_sched_tick
> >> 98831     3.0568  select_nohz_load_balancer
> >> 88163     2.7268  skb_release_data
> >> 78552     2.4296  update_curr
> >> 75241     2.3272  getnstimeofday
> >> 71400     2.2084  set_next_entity
> >> 67629     2.0917  get_next_timer_interrupt
> >> 67375     2.0839  sched_clock_tick
> >> 58112     1.7974  enqueue_entity
> >> 56462     1.7463  udp_recvmsg
> >> 55049     1.7026  copy_to_user
> >> 54277     1.6788  sched_clock_cpu
> >> 54031     1.6712  __copy_skb_header
> >> 51859     1.6040  __slab_free
> >> 51786     1.6017  prepare_to_wait_exclusive
> >> 51776     1.6014  sock_def_readable
> >> 50062     1.5484  try_to_wake_up
> >> 42182     1.3047  __switch_to
> >> 41631     1.2876  read_tsc
> >> 38337     1.1857  tick_nohz_restart_sched_tick
> >> 34358     1.0627  cpu_idle
> >> 34194     1.0576  native_sched_clock
> >> 33812     1.0458  pick_next_task_fair
> >> 33685     1.0419  resched_task
> >> 33340     1.0312  sys_recvfrom
> >> 33287     1.0296  dst_release
> >> 32439     1.0033  kmem_cache_free
> >> 32131     0.9938  hrtimer_start_range_ns
> >> 29807     0.9219  udp_queue_rcv_skb
> >> 27815     0.8603  task_rq_lock
> >> 26875     0.8312  __update_sched_clock
> >> 23912     0.7396  sock_queue_rcv_skb
> >> 21583     0.6676  __wake_up_sync
> >> 21001     0.6496  effective_load
> >> 20531     0.6350  hrtick_start_fair
> >>
> >>
> >>
> >>
> >> With IRQ affinities and msi_disable (no packet drops)
> >>
> >> CPU: Core 2, speed 3000.13 MHz (estimated)
> >> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
> >> samples  %        symbol name
> >> 79788    10.3815  schedule
> >> 69422     9.0328  mwait_idle
> >> 44877     5.8391  __skb_recv_datagram
> >> 28629     3.7250  tick_nohz_stop_sched_tick
> >> 27252     3.5459  select_nohz_load_balancer
> >> 24320     3.1644  lock_sock_nested
> >> 20833     2.7107  getnstimeofday
> >> 20666     2.6889  skb_release_data
> >> 18612     2.4217  set_next_entity
> >> 17785     2.3141  get_next_timer_interrupt
> >> 17691     2.3018  udp_recvmsg
> >> 17271     2.2472  sched_clock_tick
> >> 16032     2.0860  copy_to_user
> >> 14785     1.9237  update_curr
> >> 12512     1.6280  prepare_to_wait_exclusive
> >> 12498     1.6262  __slab_free
> >> 11380     1.4807  read_tsc
> >> 11145     1.4501  sched_clock_cpu
> >> 10598     1.3789  __switch_to
> >> 9588      1.2475  pick_next_task_fair
> >> 9480      1.2335  cpu_idle
> >> 9218      1.1994  sys_recvfrom
> >> 9008      1.1721  tick_nohz_restart_sched_tick
> >> 8977      1.1680  dst_release
> >> 8930      1.1619  native_sched_clock
> >> 8392      1.0919  kmem_cache_free
> >> 8124      1.0570  hrtimer_start_range_ns
> >> 7274      0.9464  bnx2_interrupt
> >> 7175      0.9336  __copy_skb_header
> >> 7006      0.9116  try_to_wake_up
> >> 6949      0.9042  sock_def_readable
> >> 6787      0.8831  enqueue_entity
> >> 6772      0.8811  __update_sched_clock
> >> 6349      0.8261  finish_task_switch
> >> 6164      0.8020  copy_from_user
> >> 5096      0.6631  resched_task
> >> 5007      0.6515  sysenter_past_esp
> >>
> >>
> >> I will try to investigate a litle bit more in following days if time permits.
> >>
> > I'm not 100% versed on this, but IIRC, some hardware simply can't set irq
> > affinity when operating in msi interrupt mode.  If this is the case with this
> > particular bnx2 card, then I would expect some packet loss, simply due to the
> > constant cache misses.  It would be interesting to re-run your oprofile cases,
> > counting L2 cache hits/misses (if your cpu supports that class of counter) for
> > both bnx2 running in msi enabled mode and msi disabled mode.  It would also be
> > interesting to use a different card, that can set irq affinity, and compare loss
> > with irqbalance on, and irqbalance off with irq afninty set to all cpus.
> 
> booted with msi_disable=1, IRQ of eth0 handled by CPU0 only, so that
> oprofile results sorted on CPU0 numbers.
> 
> We can see scheduler has hard time to cope with this workload with more of two CPUS
> 
> OK up to 30.000 (* 8 sockets) packets per second. 
> 
> CPU0 is 100% handling softirq (ksoftirqd/0)
> 

This explains alot.  if the application is scheduled to run on the same cpu that
has the irq for the NIC bound to it, you get a perf boost by not having to warm
up two caches (1 for the app cpu and one for the irq & softirq work), but you
loose it and then some fighting for cpu time.  If both the app and the irq are
on the same cpu, and we spend so much time in softirq context, we will
eventually overflow higher up the network stack, as the application doesn't have
enough time to dequeue frames.

It may also speak to the need to make the bnx2 napi routine more efficient :)

Neil

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html