lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXaHEk_eRJyhYfyM@gandalf.schnuecks.de>
Date: Sun, 25 Jan 2026 22:11:46 +0100
From: Simon Baatz <gmbnomis@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "David S . Miller" <davem@...emloft.net>,
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	Simon Horman <horms@...nel.org>,
	Kuniyuki Iwashima <kuniyu@...gle.com>,
	Willem de Bruijn <willemb@...gle.com>, netdev@...r.kernel.org,
	eric.dumazet@...il.com, c.ebner@...xmox.com
Subject: Re: [regression] [PATCH net-next 7/8] tcp: stronger sk_rcvbuf checks

Hi,

I am seeing a regression in the Valkey test suite with kernels >=
6.17. A bisection points to 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf
checks"), but my impression is that this change mostly makes an
underlying issue surface earlier. Additionally, 9ca48d616e ("tcp: do
not accept packets beyond window") seems to make the problem even
worse.

Valkey test scenario:

Test client opens a connection and sends a few MB worth of Valkey
commands (one write per command). The client does not perform any
reads until all commands have been sent.

Valkey server accepts the connection, enables TCP_NODELAY, and uses
non-blocking I/O to read/write. It writes a response for each command
and buffers data internally when required.

The connection runs over the loopback interface with MTU 65536. In
most cases this connection ends up stuck. The system is otherwise idle
and has plenty of free memory.

I have a a small reproducer where the client continously sends
commands and never reads. The server sends 127 bytes after reading
data (works for any payload < 2^wscale. 127 fills the buffer
fastest).

Trigger warning: reproducer code generated using AI:
https://gist.github.com/gmbnomis/0b75b2b88f49dc33d6c38ac23120b1e3

Here is a run on 6.19.0-rc6 using virtme-ng (server on port 7000, wscale is 7):

    1   0.000000    127.0.0.1 → 127.0.0.1    TCP 74 37532 → 7000 [SYN] Seq=0 Win=65495 Len=0 MSS=65495 SACK_PERM TSval=4155167414 TSecr=0 WS=128
    2   0.000080    127.0.0.1 → 127.0.0.1    TCP 74 7000 → 37532 [SYN, ACK] Seq=0 Ack=1 Win=65483 Len=0 MSS=65495 SACK_PERM TSval=4155167415 TSecr=4155167414 WS=128

[...]

At this point the client is still advertising a large receive window
although we run out of receive buffer space. This happens because
window scaling is used and we must not shrink the window (see
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/).
Before 9ca48d616e ("tcp: do not accept packets beyond window"), the
stack would accept significantly more (up to max rmem, on 6.16 this
happens after approx 5MB with standard rmem buffer settings):
 1510   0.021964    127.0.0.1 → 127.0.0.1    TCP 226 37532 → 7000 [PSH, ACK] Seq=152225 Ack=95505 Win=65536 Len=160 TSval=4155167436 TSecr=4155167436
 1511   0.021977    127.0.0.1 → 127.0.0.1    TCP 193 7000 → 37532 [PSH, ACK] Seq=95505 Ack=152385 Win=191104 Len=127 TSval=4155167436 TSecr=4155167436
 
Out of memory: drop packet #1511. Since e2142825c120 ("net: tcp: send
zero-window ACK when no memory") the adv. window is set to zero. Since
8c670bdfa58e ("tcp: correct handling of extreme memory squeeze") the
right edge of the window actually moves to the left (to 95505):
 1512   0.021987    127.0.0.1 → 127.0.0.1    TCP 226 [TCP ZeroWindow] 37532 → 7000 [PSH, ACK] Seq=152385 Ack=95505 Win=0 Len=160 TSval=4155167436 TSecr=4155167436
 1513   0.029462    127.0.0.1 → 127.0.0.1    TCP 65549 [TCP ZeroWindow] 37532 → 7000 [ACK] Seq=152545 Ack=95505 Win=0 Len=65483 TSval=4155167444 TSecr=4155167436
 
The server has already sent data up to 95632, and continues with that
Seq, but from the client’s point of view this is now beyond the
receive window:
 1514   0.029847    127.0.0.1 → 127.0.0.1    TCP 66 7000 → 37532 [ACK] Seq=95632 Ack=218028 Win=191104 Len=0 TSval=4155167444 TSecr=4155167436
 
Send an ack for the ack #1514 because it is considered outside of the window:
 1515   0.029856    127.0.0.1 → 127.0.0.1    TCP 66 [TCP ZeroWindow] 37532 → 7000 [ACK] Seq=218028 Ack=95505 Win=0 Len=0 TSval=4155167444 TSecr=4155167436
 1516   0.033116    127.0.0.1 → 127.0.0.1    TCP 42615 [TCP ZeroWindow] 37532 → 7000 [PSH, ACK] Seq=218028 Ack=95505 Win=0 Len=42549 TSval=4155167448 TSecr=4155167436
 1517   0.037074    127.0.0.1 → 127.0.0.1    TCP 65549 [TCP ZeroWindow] 37532 → 7000 [ACK] Seq=260577 Ack=95505 Win=0 Len=65483 TSval=4155167451 TSecr=4155167436
 1518   0.037093    127.0.0.1 → 127.0.0.1    TCP 66 7000 → 37532 [ACK] Seq=95632 Ack=326060 Win=218240 Len=0 TSval=4155167452 TSecr=4155167448
 
All acks since #1514 are dropped. Thus, we see a retransmission of #1512.
1519   0.229190    127.0.0.1 → 127.0.0.1    TCP 226 [TCP ZeroWindow] [TCP Spurious Retransmission] 37532 → 7000 [PSH, ACK] Seq=152385 Ack=95505 Win=0 Len=160 TSval=4155167644 TSecr=4155167436
 
The server retransmits its last unacked segment. Since 9ca48d616e
"tcp: do not accept packets beyond window", this segment is dropped as
it extends beyond the window. (Before, this packet passed the sequence
number check and the client sent a window of fresh data on each
retransmission attempt (and only then)).
 1520   0.229201    127.0.0.1 → 127.0.0.1    TCP 193 [TCP Retransmission] 7000 → 37532 [PSH, ACK] Seq=95505 Ack=326060 Win=240896 Len=127 TSval=4155167644 TSecr=4155167448
 1521   0.229206    127.0.0.1 → 127.0.0.1    TCP 78 [TCP Dup ACK 1518#1] 7000 → 37532 [ACK] Seq=95632 Ack=326060 Win=240896 Len=0 TSval=4155167644 TSecr=4155167448 SLE=152385 SRE=152545
 
The connection is effectively stuck: Neither acks nor retranmissions
from the server are even being looked at.

I tried reverting 8c670bdfa58e ("tcp: correct handling of extreme
memory squeeze") on top of 6.19‑rc6. Connection does not hang, but is
broken from a protocol perspective:

0.018805    127.0.0.1 → 127.0.0.1    TCP 226 [TCP ZeroWindow] 34358 → 7000 [PSH, ACK] Seq=151425 Ack=95505 Win=0 Len=160 TSval=602409183 TSecr=602409183
0.024775    127.0.0.1 → 127.0.0.1    TCP 65549 34358 → 7000 [ACK] Seq=151585 Ack=95505 Win=65536 Len=65483 TSval=602409189 TSecr=602409183
0.024800    127.0.0.1 → 127.0.0.1    TCP 193 7000 → 34358 [PSH, ACK] Seq=95632 Ack=217068 Win=97408 Len=127 TSval=602409189 TSecr=602409183

When setting "net.ipv4.tcp_shrink_window=1", packets that cause the
window to shrink (to zero) are accepted (instead of dropping them). 
This helps in this particular scenario, since there is only one
packet in flight.  However, when there are still packets in flight at
the moment the window is closed, those packets are beyond the window
once they arrive (which is correct), but all further packets
sent by the server are regarded as beyond the window as well.

I am not sure what to make out of all of this. It seems that we cannot
always avoid shrinking the receive window (if window scaling is used).
Do we need to track the maximum advertised right edge for
sequence number validation?

- Simon

-- 
Simon Baatz <gmbnomis@...il.com>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ