netdev - Re: Issue with delayed segments despite TCP

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQykQ+NGdONiK6AwL9CN=nj-8C6rwS4dtf-6p1f+JFyVqug@mail.gmail.com>
Date: Mon, 26 May 2025 09:50:23 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Dennis Baurichter <dennisba@...l.uni-paderborn.de>
Cc: netdev@...r.kernel.org, netfilter@...r.kernel.org, 
	Eric Dumazet <edumazet@...gle.com>
Subject: Re: Issue with delayed segments despite TCP_NODELAY

On Sun, May 25, 2025 at 9:01 PM Dennis Baurichter
<dennisba@...l.uni-paderborn.de> wrote:
>
> Hi,
>
> I have a question on why the kernel stops sending further TCP segments
> after the handshake and first 2 (or 3) payload segments have been sent.
> This seems to happen if the round trip time is "too high" (e.g., over
> 9ms or 15ms, depending on system). Remaining segments are (apparently)
> only sent after an ACK has been received, even though TCP_NODELAY is set
> on the socket.
>
> This is happening on a range of different kernels, from Arch Linux'
> 6.14.7 (which should be rather close to mainline) down to Ubuntu 22.04's
> 5.15.0-134-generic (admittedly somewhat "farther away" from mainline). I
> can test on an actual mainline kernel, too, if that helps.
> I will describe our (probably somewhat uncommon) setup below. If you
> need any further information, I'll be happy to provide it.
>
> My colleague and I have the following setup:
> - Userland application connects to a server via TCP/IPv4 (complete TCP
> handshake is performed).
> - A nftables rule is added to intercept packets of this connection and
> put them into a netfilter queue.
> - Userland application writes data into this TCP socket.
>    - The data is written in up to 4 chunks, which are intended to end up
> in individual TCP segments.
>    - The socket has TCP_NODELAY set.
>    - sysctl net.ipv4.tcp_autocorking=0
> - The above nftables rule is removed.
> - Userland application (a different part of it) retrieves all packets
> from the netfilter queue.
>    - Here it may occur that e.g. only 2 out of 4 segments can be retrieved.
>    - Reading from the netfilter queue is attempted until 5 timeouts of
> 20ms each occured. Even much higher timeout values don't change the
> results, so it's not a race condition.
> - Userland application performs some modifications on the intercepted
> segments and eventually issues verdict NF_ACCEPT.
>
> We checked (via strace) that all payload chunks are successfully written
> to the socket, (via nlmon kernel module) that there are no errors in the
> netlink communication, and (via nft monitor) that indeed no further
> segments traverse the netfilter pipeline before the first two payload
> segments are actually sent on the wire.
> We dug through the entire list of TCP and IPv4 sysctls (testing several
> of them), tried loading and using different congestion algorithm
> modules, toggling TCP_NODELAY off and on between each write to the
> socket (to trigger an explicit flush), and other things, but to no avail.
>
> Modifying our code, we can see that after NF_ACCEPT'ing the first
> segments, we can retrieve the remaining segments from netfilter queue.
> In Wireshark we see that this seems to be triggered by the incoming ACK
> segment from the server.
>
> Notably, we can intercept all segments at once when testing this on
> localhost or in a LAN network. However, on long-distance /
> higher-latency connections, we can only intercept 2 (sometimes 3) segments.
>
> Testing on a LAN connection from an old laptop to a fast PC, we delayed
> packets on the latter one with variants of:
> tc qdisc add dev eth0 root netem delay 15ms
> We got the following mappings of delay / rtt to number of segments
> intercepted:
> below 15ms -> all (up to 4) segments intercepted
> 15-16ms -> 2-3 segments
> 16-17ms -> 2 (sometimes 3) segments
> over 20ms -> 2 segments (tested 20ms, 200ms, 500ms)
> Testing in the other direction, from fast PC to old laptop (which now
> has the qdisc delay), we get similar results, just with lower round trip
> times (15ms becomes more like 8-9ms).
>
> We would very much appreciate it if someone could help us on the
> following questions:
> - Why are the remaining segments not send out immediately, despite
> TCP_NODELAY?
> - Is there a way to change this?
> - If not, do you have better workarounds than injecting a fake ACK
> pretending to come "from the server" via a raw socket?
>    Actually, we haven't tried this yet, but probably will soon.

Sounds like you are probably seeing the effects of TCP Small Queues
(TSQ) limiting the number of skbs queued in various layers of the
sending machine. See tcp_small_queue_check() for details.

Probably with shorter RTTs the incoming ACKs clear skbs from the rtx
queue, and thus the tcp_small_queue_check() call to
tcp_rtx_queue_empty_or_single_skb(sk) returns true and
tcp_small_queue_check() returns false, enabling transmissions.

What is it that you are trying to accomplish with this nftables approach?

neal