netdev - Re: Loopback performance from kernel 2.6.12 to 2.6.37

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1290102113.2781.237.camel@edumazet-laptop>
Date:	Thu, 18 Nov 2010 18:41:53 +0100
From:	Eric Dumazet <eric.dumazet@...il.com>
To:	Jesper Dangaard Brouer <jdb@...x.dk>
Cc:	netdev <netdev@...r.kernel.org>, David Miller <davem@...emloft.net>
Subject: Re: Loopback performance from kernel 2.6.12 to 2.6.37

Le jeudi 18 novembre 2010 à 14:52 +0100, Eric Dumazet a écrit :
> Le mardi 09 novembre 2010 à 15:25 +0100, Eric Dumazet a écrit :
> 
> > So far, so good. These are the expected numbers. Now we have to
> > understand why corei7 gets 38 seconds instead of 8 :)
> > 
> > 
> 
> My tests show a problem with backlog processing, and too big TCP
> windows. (at least on loopback and wild senders)
> 
> Basically, with huge tcp windows we have now (default 4 Mbytes),
> the reader process can have to process up to 4Mbytes of backlogged data
> in __release_sock() before returning from its 'small' read(fd, buffer,
> 1024) done by netcat.
> 
> While it processes this backlog, it sends tcp ACKS, allowing sender to
> send new frames that might be dropped because of sk_rcvqueues_full(), or
> continue to fill receive queue up to the receiver window, feeding the
> task in __release_sock() [loop]
> 
> 
> This blows cpu caches completely [data is queued, and the dequeue is
> done long after], and latency of a single read() can be very high. This
> blocks the pipeline of user processing eventually.
> 
> 
> <disgress>
> I also understand why UDP latencies are so impacted. If we receive a
> burst of frames on same socket, the user process reading first frame
> might be forced to process the backlog before returning to userland.
> 
> Really we must zap lock_sock() from UDP input path.
> 
> commit 95766fff6b9a78d1 ([UDP]: Add memory accounting) was a big error.
> </disgress>
> 
> 
> 
> On my server machine with 6Mbytes of L2 cache, you dont see the problem,
> while on my laptop with 3Mbytes of L2 cache, you can see the problem.
> 
> I caught this because of new SNMP counter added in 2.6.34
> (TCPBacklogDrop), that could easily take 1000 increments during the
> test.
> 
> 
> I built a test program, maybe easier to use than various netcat flavors
> It also use two tasks only, thats better if you have a core 2 Duo like
> me on my laptop ;)
> 
> To reproduce the problem, run it with option -l 4M
> 
> $ netstat -s|grep TCPBacklogDrop
>     TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 1k;netstat -s|grep TCPBacklogDrop
> 
> real	0m14.013s
> user	0m0.630s
> sys	0m13.250s
>     TCPBacklogDrop: 788
> $ time ./loopback_transfert -l 128k;netstat -s|grep TCPBacklogDrop
> 
> real	0m7.447s
> user	0m0.030s
> sys	0m5.490s
>     TCPBacklogDrop: 789
> $ time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop
> 
> real	0m11.206s
> user	0m0.020s
> sys	0m7.150s
>     TCPBacklogDrop: 793
> $ time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop
> 
> real	0m10.347s
> user	0m0.000s
> sys	0m6.120s
>     TCPBacklogDrop: 1510
> $ time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop
> 
> real	0m6.810s
> user	0m0.040s
> sys	0m6.670s
>     TCPBacklogDrop: 1511
> 

I forgot to include test results for my dev machine (server class
machine, 8 Mbytes of L2 cache) NUMA 
2 sockets : Intel(R) Xeon(R) CPU           E5540  @ 2.53GHz

# netstat -s|grep TCPBacklogDrop
    TCPBacklogDrop: 8891
# time ./loopback_transfert -l 16k;netstat -s|grep TCPBacklogDrop

real	0m7.033s
user	0m0.010s
sys	0m4.580s
    TCPBacklogDrop: 9239
# time ./loopback_transfert -l 1M;netstat -s|grep TCPBacklogDrop

real	0m5.408s
user	0m0.000s
sys	0m2.880s
    TCPBacklogDrop: 9243
# time ./loopback_transfert -l 4M;netstat -s|grep TCPBacklogDrop

real	0m2.965s
user	0m0.000s
sys	0m2.070s
    TCPBacklogDrop: 10485
# time ./loopback_transfert -l 6M;netstat -s|grep TCPBacklogDrop

real	0m7.711s
user	0m0.000s
sys	0m3.180s
    TCPBacklogDrop: 13537
# time ./loopback_transfert -l 8M;netstat -s|grep TCPBacklogDrop

real	0m11.497s
user	0m0.020s
sys	0m3.830s
    TCPBacklogDrop: 17108


As soon as our working set is larger than L2 cache, this is very slow.

for the -l 8M bench :

# Events: 7K cycles
#
# Overhead  Command      Shared Object                               Symbol
# ........  .......  .................  ...................................
#
    40.97%  loopback_transf  [kernel.kallsyms]  [k] copy_user_generic_string
    18.57%    :8968  [kernel.kallsyms]  [k] copy_user_generic_string
     3.54%    :8968  [kernel.kallsyms]  [k] get_page_from_freelist
     3.36%    :8968  [kernel.kallsyms]  [k] tcp_sendmsg
     1.17%    :8968  [kernel.kallsyms]  [k] put_page
     0.99%    :8968  [kernel.kallsyms]  [k] free_hot_cold_page
     0.99%    :8968  [kernel.kallsyms]  [k] __might_sleep
     0.88%    :8968  [kernel.kallsyms]  [k] __ticket_spin_lock
     0.81%  loopback_transf  [kernel.kallsyms]  [k] free_pcppages_bulk
     0.79%    :8968  [kernel.kallsyms]  [k] __alloc_pages_nodemask
     0.63%  loopback_transf  [kernel.kallsyms]  [k] put_page
     0.63%  loopback_transf  [kernel.kallsyms]  [k] __might_sleep
     0.63%  loopback_transf  [kernel.kallsyms]  [k] tcp_transmit_skb
     0.57%    :8968  [kernel.kallsyms]  [k] skb_release_data
     0.55%  loopback_transf  [kernel.kallsyms]  [k] free_hot_cold_page
     0.53%    :8968  [kernel.kallsyms]  [k] tcp_ack
     0.50%  loopback_transf  [kernel.kallsyms]  [k] __inet_lookup_established
     0.49%  loopback_transf  [kernel.kallsyms]  [k] skb_copy_datagram_iovec
     0.47%    :8968  [kernel.kallsyms]  [k] __rmqueue
     0.45%    :8968  [kernel.kallsyms]  [k] get_pageblock_flags_group
     0.41%    :8968  [kernel.kallsyms]  [k] zone_watermark_ok
     0.41%    :8968  [kernel.kallsyms]  [k] __inc_zone_state
     0.40%  loopback_transf  [kernel.kallsyms]  [k] skb_release_data
     0.39%    :8968  [kernel.kallsyms]  [k] tcp_transmit_skb


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html