netdev - Re: UDP multi-core performance on a single socket and SO

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50E72493.4050406@markandruth.co.uk>
Date:	Fri, 04 Jan 2013 18:50:59 +0000
From:	Mark Zealey <netdev@...kandruth.co.uk>
To:	netdev@...r.kernel.org
Subject: Re: UDP multi-core performance on a single socket and SO_REUSEPORT

I have written two small test scripts now which can be found at 
http://mark.zealey.org/uploads/ - one launches 16 listening threads for 
a single UDP socket, the other needs to be run as

for i in `seq 16`; do ./udp_test_client & done

On my test server (32-core), stock kernel 3.7.1, 90% of the time is 
spent in the kernel waiting on spinlocks. Perf output:

     44.95%  udp_test_server  [kernel.kallsyms]   [k] _raw_spin_lock_bh
             |
             --- _raw_spin_lock_bh
                |
                |--100.00%-- lock_sock_fast
                |          skb_free_datagram_locked
                |          udp_recvmsg
                |          inet_recvmsg
                |          sock_recvmsg
                |          __sys_recvmsg
                |          sys_recvmsg
                |          system_call_fastpath
                |          0x7fd8c4702a2d
                |          start_thread
                 --0.00%-- [...]

     43.48%  udp_test_client  [kernel.kallsyms]   [k] _raw_spin_lock
             |
             --- _raw_spin_lock
                |
                |--99.80%-- udp_queue_rcv_skb
                |          __udp4_lib_rcv
                |          udp_rcv
                |          ip_local_deliver_finish
                |          ip_local_deliver
                |          ip_rcv_finish
                |          ip_rcv

Thanks,

Mark

On 28/12/12 10:01, Mark Zealey wrote:
> I appreciate that this question has come up a number of times over the 
> years, most recently as far as I can see in this thread: 
> http://markmail.org/message/hcc7zn5ln5wktypv . I'm going to explain my 
> problem and present some performance numbers to back this up.
>
> The problem: I'm doing some research on scaling a dns server 
> (powerdns) to work well on multi-core boxes (in this case testing with 
> 2*E5-2650 processors ie linux sees 32 cores).
>
> My powerdns configuration uses a shared socket with one thread for 
> each core in the box listening on that socket using poll()/recvmsg(). 
> I've modified powerdns so in my tests it is doing the absolute minimum 
> of work to answer packets (all queries are for the same record, it 
> keeps the response in memory and just changes a few fields before 
> calling sendmsg()). I'm binding to a single 10.xxx address and using 
> this for all local and remote tests.
>
> The numbers below are generated using 16 parallel queryperf's on 
> localhost (it doesn't really matter if it is from remote hosts or the 
> localhost; the numbers don't change much).
>
> Using stock centos 6.3 kernel I see powerdns performing at around 
> 120kqps (uses at most about 12 cpus)
> Using 3.7.1 kernel (from elrepo) I see this increase to 200-240kqps 
> maxing out all cpu's in the box (soft interrupt cpu time is about 8* 
> higher than on centos 6.3 kernel at 40% and system cpu time is at 50% 
> - powerdns only uses 10% of the cpu time)
> Using stock centos 6.3 kernel with the google SO_REUSEPORT patch from 
> 2010 (modified slightly so it applies) I see 500-600kqps from remote; 
> or 1mqps when doing localhost queries. powerdns doesn't go past using 
> 8 cpus - it appears that the limit it is hitting then is to do with 
> some lock in sendmsg().
>
> I've not been able to get the 2010 SO_REUSEPORT patch working on the 
> 3.7.1 kernel I suspect it would make for even better performance as 
> sendmsg() should have been significantly improved.
>
> Now, I don't believe that SO_REUSEPORT is needed in the kernel in this 
> case, however the numbers above clearly show that the current UDP 
> implementation for recvmsg() on a single socket across multiple cores 
> on kernel 3.7.1 is still locking badly. A perf report on 3.7.1 (using 
> 16 local queryperf's) shows:
>
>     68.34%  pdns_server  [kernel.kallsyms]    [k] _raw_spin_lock_bh
>             |
>             --- 0x7fa472023a2d
>                 system_call_fastpath
>                 sys_recvmsg
>                 __sys_recvmsg
>                 sock_recvmsg
>                 inet_recvmsg
>                 udp_recvmsg
>                 skb_free_datagram_locked
>                |
>                |--100.00%-- lock_sock_fast
>                |          _raw_spin_lock_bh
>                 --0.00%-- [...]
>
>      3.10%  pdns_server  [kernel.kallsyms]    [k] _raw_spin_lock_irqsave
>             |
>             --- 0x7fa472023a2d
>                 system_call_fastpath
>                 sys_recvmsg
>                 __sys_recvmsg
>                 sock_recvmsg
>                 inet_recvmsg
>                 udp_recvmsg
>                |
>                |--99.69%-- __skb_recv_datagram
>                |          |
>                |          |--77.68%-- _raw_spin_lock_irqsave
>                |          |
>                |          |--14.56%-- prepare_to_wait_exclusive
>                |          |          _raw_spin_lock_irqsave
>                |          |
>                |           --7.76%-- finish_wait
>                |                     _raw_spin_lock_irqsave
>                 --0.31%-- [...]
>                ...
>
> Any advice or patches welcome... :-)
>
> Mark
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html