netdev - epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets with a small-ish snd/rcv buffer.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190706034508.43aabff0@hikaru>
Date:   Sat, 6 Jul 2019 03:45:08 +0200
From:   Carlo Wood <carlo@...noe.com>
To:     davem@...emloft.net, netdev@...r.kernel.org
Subject: epoll_wait() (and epoll_pwait) stall for 206 ms per call on sockets
 with a small-ish snd/rcv buffer.

Hi all,

hopefully I'm reporting this in the right place.

While working on my C++ networking library, I ran into a problem that
I have traced back to the kernel itself.

The problem is related to setting a socket buffer size (snd/rcv).

At first I noticed the problem while trying to test the thread-safety
of my lock free stream buffer implementation, with std::streambuf
interface, that allows reading and writing by different threads
at the same time. In order to test this, I read data from a socket
into a buffer and write that back to the same socket from the
same buffer. This resulted in a speed of only a few 10kB/s bandwidth,
and after investigation turned out to be caused by epoll_pwait()
not returning until 205 to 208 ms had passed, even though data was
already available.

At first I thought it could be a bug in libev that I was using as
wrapper around epoll - so I ripped that out and wrote my own code to
interface with epoll, but with the same result. Next I changed the test
into a one-way stream: just send data from one socket to another (not
back); also with the same result. Then I wrote a test application that
was standalone, single threaded and in C. Here I discovered that first
calling connect() on a socket and then (immediately after) setting the
socket buffer sizes caused the stall to occur. Assuming that this
is not allowed I thought I had solved it.

But, back to sending data back and forth over the TCP connection
caused the problem to return.

I wrote again a standalone C program, single threaded, that
reproduces the problem.

You can find this program here (I'm scared to trigger spam filters
by attaching it here, but let me know if that is OK):
https://github.com/CarloWood/ai-evio-testsuite/blob/e6e36b66256ec5d1760d2b7b96b34d60a12477aa/src/epoll_bug.c

With socket snd/rcv buffers smaller than 32kB I always get
the problem, but using 64kB (65536 bytes) I don't, on my box.

I tested this with kernels 4.15, 5.2.0-rc7 and had someone
else confirm it with 5.1.15.

When running the program as-is, it will start printing a flood
of "epoll_wait() was delayed!" meaning that epoll_wait() didn't
return within 10 ms. The actual delay can be seen with for
example strace -T. You can also uncomment the commented printf's
which will print what is the amount of bytes written to and read
from each socket and therefore what is the number of bytes that
are still "in the pipe line".

What I observe is that the number of bytes in the pipe line is
very constant (because this just writes till no more can be
written, and then - on the other end - reads till there is nothing
to be read anymore - and that single threaded) until suddenly
that amount jumps UP to about 2.5 MB - then is SLOWLY eaten till
a say 1.5 MB (without that epoll says more can be written)
when suddenly more can be written and it fills up the pipe line
again.

I have no idea where those mega bytes of data are buffered; the
socket buffers are WAY smaller.

Obviously I have no idea if this is a problem with epoll itself,
or that there is a problem with TCP/IP for this case.
But I am not a kernel hacker, so all I could to was provide this
little test case that allows you to reproduce it.

-- 
Carlo Wood <carlo@...noe.com>