[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1290102113.2781.237.camel@edumazet-laptop>
Date: Thu, 18 Nov 2010 18:41:53 +0100
From: Eric Dumazet <eric.dumazet@...il.com>
To: Jesper Dangaard Brouer <jdb@...x.dk>
Cc: netdev <netdev@...r.kernel.org>, David Miller <davem@...emloft.net>
Subject: Re: Loopback performance from kernel 2.6.12 to 2.6.37
Le jeudi 18 novembre 2010 à 14:52 +0100, Eric Dumazet a écrit :
> Le mardi 09 novembre 2010 à 15:25 +0100, Eric Dumazet a écrit :
>
> > So far, so good. These are the expected numbers. Now we have to
> > understand why corei7 gets 38 seconds instead of 8 :)
> >
> >
>
> My tests show a problem with backlog processing, and too big TCP
> windows. (at least on loopback and wild senders)
>
> Basically, with huge tcp windows we have now (default 4 Mbytes),
> the reader process can have to process up to 4Mbytes of backlogged data
> in __release_sock() before returning from its 'small' read(fd, buffer,
> 1024) done by netcat.
>
> While it processes this backlog, it sends tcp ACKS, allowing sender to
> send new frames that might be dropped because of sk_rcvqueues_full(), or
> continue to fill receive queue up to the receiver window, feeding the
> task in __release_sock() [loop]
>
>
> This blows cpu caches completely [data is queued, and the dequeue is
> done long after], and latency of a single read() can be very high. This
> blocks the pipeline of user processing eventually.
>
>
> <disgress>
> I also understand why UDP latencies are so impacted. If we receive a
> burst of frames on same socket, the user process reading first frame
> might be forced to process the backlog before returning to userland.
>
> Really we must zap lock_sock() from UDP input path.
>
> commit 95766fff6b9a78d1 ([UDP]: Add memory accounting) was a big error.
> </disgress>
>
>
>
> On my server machine with 6Mbytes of L2 cache, you dont see the problem,
> while on my laptop with 3Mbytes of L2 cache, you can see the problem.
>
> I caught this because of new SNMP counter added in 2.6.34
> (TCPBacklogDrop), that could easily take 1000 increments during the
> test.
>
>
> I built a test program, maybe easier to use than various netcat flavors
> It also use two tasks only, thats better if you have a core 2 Duo like
> me on my laptop ;)
>
> To reproduce the problem, run it with option -l 4M
>
> $ netstat -s|grep TCPBacklogDrop
> TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 1k;netstat -s|grep TCPBacklogDrop
>
> real 0m14.013s
> user 0m0.630s
> sys 0m13.250s
> TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 128k;netstat -s|grep TCPBacklogDrop
>
> real 0m7.447s
> user 0m0.030s
> sys 0m5.490s
> TCPBacklogDrop: 789
> $ time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop
>
> real 0m11.206s
> user 0m0.020s
> sys 0m7.150s
> TCPBacklogDrop: 793
> $ time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop
>
> real 0m10.347s
> user 0m0.000s
> sys 0m6.120s
> TCPBacklogDrop: 1510
> $ time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop
>
> real 0m6.810s
> user 0m0.040s
> sys 0m6.670s
> TCPBacklogDrop: 1511
>
I forgot to include test results for my dev machine (server class
machine, 8 Mbytes of L2 cache) NUMA
2 sockets : Intel(R) Xeon(R) CPU E5540 @ 2.53GHz
# netstat -s|grep TCPBacklogDrop
TCPBacklogDrop: 8891
# time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop
real 0m7.033s
user 0m0.010s
sys 0m4.580s
TCPBacklogDrop: 9239
# time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop
real 0m5.408s
user 0m0.000s
sys 0m2.880s
TCPBacklogDrop: 9243
# time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop
real 0m2.965s
user 0m0.000s
sys 0m2.070s
TCPBacklogDrop: 10485
# time ./loopback_transfert -l 6M;netstat -s|grep TCPBacklogDrop
real 0m7.711s
user 0m0.000s
sys 0m3.180s
TCPBacklogDrop: 13537
# time ./loopback_transfert -l 8M;netstat -s|grep TCPBacklogDrop
real 0m11.497s
user 0m0.020s
sys 0m3.830s
TCPBacklogDrop: 17108
As soon as our working set is larger than L2 cache, this is very slow.
for the -l 8M bench :
# Events: 7K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... ................. ...................................
#
40.97% loopback_transf [kernel.kallsyms] [k] copy_user_generic_string
18.57% :8968 [kernel.kallsyms] [k] copy_user_generic_string
3.54% :8968 [kernel.kallsyms] [k] get_page_from_freelist
3.36% :8968 [kernel.kallsyms] [k] tcp_sendmsg
1.17% :8968 [kernel.kallsyms] [k] put_page
0.99% :8968 [kernel.kallsyms] [k] free_hot_cold_page
0.99% :8968 [kernel.kallsyms] [k] __might_sleep
0.88% :8968 [kernel.kallsyms] [k] __ticket_spin_lock
0.81% loopback_transf [kernel.kallsyms] [k] free_pcppages_bulk
0.79% :8968 [kernel.kallsyms] [k] __alloc_pages_nodemask
0.63% loopback_transf [kernel.kallsyms] [k] put_page
0.63% loopback_transf [kernel.kallsyms] [k] __might_sleep
0.63% loopback_transf [kernel.kallsyms] [k] tcp_transmit_skb
0.57% :8968 [kernel.kallsyms] [k] skb_release_data
0.55% loopback_transf [kernel.kallsyms] [k] free_hot_cold_page
0.53% :8968 [kernel.kallsyms] [k] tcp_ack
0.50% loopback_transf [kernel.kallsyms] [k] __inet_lookup_established
0.49% loopback_transf [kernel.kallsyms] [k] skb_copy_datagram_iovec
0.47% :8968 [kernel.kallsyms] [k] __rmqueue
0.45% :8968 [kernel.kallsyms] [k] get_pageblock_flags_group
0.41% :8968 [kernel.kallsyms] [k] zone_watermark_ok
0.41% :8968 [kernel.kallsyms] [k] __inc_zone_state
0.40% loopback_transf [kernel.kallsyms] [k] skb_release_data
0.39% :8968 [kernel.kallsyms] [k] tcp_transmit_skb
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists