[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <1dbe0f24-1076-4e91-b2c2-765a0e28b017@mail.uni-paderborn.de>
Date: Mon, 26 May 2025 02:44:10 +0200
From: Dennis Baurichter <dennisba@...l.uni-paderborn.de>
To: netdev@...r.kernel.org, netfilter@...r.kernel.org
Subject: Issue with delayed segments despite TCP_NODELAY
Hi,
I have a question on why the kernel stops sending further TCP segments
after the handshake and first 2 (or 3) payload segments have been sent.
This seems to happen if the round trip time is "too high" (e.g., over
9ms or 15ms, depending on system). Remaining segments are (apparently)
only sent after an ACK has been received, even though TCP_NODELAY is set
on the socket.
This is happening on a range of different kernels, from Arch Linux'
6.14.7 (which should be rather close to mainline) down to Ubuntu 22.04's
5.15.0-134-generic (admittedly somewhat "farther away" from mainline). I
can test on an actual mainline kernel, too, if that helps.
I will describe our (probably somewhat uncommon) setup below. If you
need any further information, I'll be happy to provide it.
My colleague and I have the following setup:
- Userland application connects to a server via TCP/IPv4 (complete TCP
handshake is performed).
- A nftables rule is added to intercept packets of this connection and
put them into a netfilter queue.
- Userland application writes data into this TCP socket.
- The data is written in up to 4 chunks, which are intended to end up
in individual TCP segments.
- The socket has TCP_NODELAY set.
- sysctl net.ipv4.tcp_autocorking=0
- The above nftables rule is removed.
- Userland application (a different part of it) retrieves all packets
from the netfilter queue.
- Here it may occur that e.g. only 2 out of 4 segments can be retrieved.
- Reading from the netfilter queue is attempted until 5 timeouts of
20ms each occured. Even much higher timeout values don't change the
results, so it's not a race condition.
- Userland application performs some modifications on the intercepted
segments and eventually issues verdict NF_ACCEPT.
We checked (via strace) that all payload chunks are successfully written
to the socket, (via nlmon kernel module) that there are no errors in the
netlink communication, and (via nft monitor) that indeed no further
segments traverse the netfilter pipeline before the first two payload
segments are actually sent on the wire.
We dug through the entire list of TCP and IPv4 sysctls (testing several
of them), tried loading and using different congestion algorithm
modules, toggling TCP_NODELAY off and on between each write to the
socket (to trigger an explicit flush), and other things, but to no avail.
Modifying our code, we can see that after NF_ACCEPT'ing the first
segments, we can retrieve the remaining segments from netfilter queue.
In Wireshark we see that this seems to be triggered by the incoming ACK
segment from the server.
Notably, we can intercept all segments at once when testing this on
localhost or in a LAN network. However, on long-distance /
higher-latency connections, we can only intercept 2 (sometimes 3) segments.
Testing on a LAN connection from an old laptop to a fast PC, we delayed
packets on the latter one with variants of:
tc qdisc add dev eth0 root netem delay 15ms
We got the following mappings of delay / rtt to number of segments
intercepted:
below 15ms -> all (up to 4) segments intercepted
15-16ms -> 2-3 segments
16-17ms -> 2 (sometimes 3) segments
over 20ms -> 2 segments (tested 20ms, 200ms, 500ms)
Testing in the other direction, from fast PC to old laptop (which now
has the qdisc delay), we get similar results, just with lower round trip
times (15ms becomes more like 8-9ms).
We would very much appreciate it if someone could help us on the
following questions:
- Why are the remaining segments not send out immediately, despite
TCP_NODELAY?
- Is there a way to change this?
- If not, do you have better workarounds than injecting a fake ACK
pretending to come "from the server" via a raw socket?
Actually, we haven't tried this yet, but probably will soon.
Regards,
Dennis
Powered by blists - more mailing lists