linux-kernel - Re: epoll_wait() performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <0b8d7447e129539aec559fa797c07047f5a6a1b2.camel@redhat.com>
Date:   Wed, 27 Nov 2019 17:26:48 +0100
From:   Paolo Abeni <pabeni@...hat.com>
To:     Jesper Dangaard Brouer <brouer@...hat.com>,
        David Laight <David.Laight@...LAB.COM>
Cc:     'Marek Majkowski' <marek@...udflare.com>,
        linux-kernel <linux-kernel@...r.kernel.org>,
        network dev <netdev@...r.kernel.org>,
        kernel-team <kernel-team@...udflare.com>
Subject: Re: epoll_wait() performance

On Wed, 2019-11-27 at 16:48 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 27 Nov 2019 10:39:44 +0000 David Laight <David.Laight@...LAB.COM> wrote:
> 
> > ...
> > > > While using recvmmsg() to read multiple messages might seem a good idea, it is much
> > > > slower than recv() when there is only one message (even recvmsg() is a lot slower).
> > > > (I'm not sure why the code paths are so slow, I suspect it is all the copy_from_user()
> > > > and faffing with the user iov[].)
> > > > 
> > > > So using poll() we repoll the fd after calling recv() to find is there is a second message.
> > > > However the second poll has a significant performance cost (but less than using recvmmsg()).  
> > > 
> > > That sounds wrong. Single recvmmsg(), even when receiving only a
> > > single message, should be faster than two syscalls - recv() and
> > > poll().  
> > 
> > My suspicion is the extra two copy_from_user() needed for each recvmsg are a
> > significant overhead, most likely due to the crappy code that tries to stop
> > the kernel buffer being overrun.
> > 
> > I need to run the tests on a system with a 'home built' kernel to see how much
> > difference this make (by seeing how much slower duplicating the copy makes it).
> > 
> > The system call cost of poll() gets factored over a reasonable number of sockets.
> > So doing poll() on a socket with no data is a lot faster that the setup for recvmsg
> > even allowing for looking up the fd.
> > 
> > This could be fixed by an extra flag to recvmmsg() to indicate that you only really
> > expect one message and to call the poll() function before each subsequent receive.
> > 
> > There is also the 'reschedule' that Eric added to the loop in recvmmsg.
> > I don't know how much that actually costs.
> > In this case the process is likely to be running at a RT priority and pinned to a cpu.
> > In some cases the cpu is also reserved (at boot time) so that 'random' other code can't use it.
> > 
> > We really do want to receive all these UDP packets in a timely manner.
> > Although very low latency isn't itself an issue.
> > The data is telephony audio with (typically) one packet every 20ms.
> > The code only looks for packets every 10ms - that helps no end since, in principle,
> > only a single poll()/epoll_wait() call (on all the sockets) is needed every 10ms.
> 
> I have a simple udp_sink tool[1] that cycle through the different
> receive socket system calls.  I gave it a quick spin on a F31 kernel
> 5.3.12-300.fc31.x86_64 on a mlx5 100G interface, and I'm very surprised
> to see a significant regression/slowdown for recvMmsg.
> 
> $ sudo ./udp_sink --port 9 --repeat 1 --count $((10**7))
>           	run      count   	ns/pkt	pps		cycles	payload
> recvMmsg/32  	run:  0	10000000	1461.41	684270.96	5261	18	 demux:1
> recvmsg   	run:  0	10000000	889.82	1123824.84	3203	18	 demux:1
> read      	run:  0	10000000	974.81	1025841.68	3509	18	 demux:1
> recvfrom  	run:  0	10000000	1056.51	946513.44	3803	18	 demux:1
> 
> Normal recvmsg almost have double performance that recvmmsg.

For stream tests, the above is true, if the BH is able to push the
packets to the socket fast enough. Otherwise the recvmmsg() will make
the user space even faster, the BH will find the user space process
sleeping more often and the BH will have to spend more time waking-up
the process.

If a single receive queue is in use this condition is not easy to meet.

Before spectre/meltdown and others mitigations using connected sockets
and removing ct/nf was usually sufficient - at least in my scenarios -
to make BH fast enough. 

But it's no more the case, and I have to use 2 or more different
receive queues.

@David: If I read your message correctly, the pkt rate you are dealing
with is quite low... are we talking about tput or latency? I guess
latency could be measurably higher with recvmmsg() in respect to other
syscall. How do you measure the releative performances of recvmmsg()
and recv() ? with micro-benchmark/rdtsc()? Am I right that you are
usually getting a single packet per recvmmsg() call?

Thanks,

PAolo