lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAP1cthn5mm3H7Gi46yNrvzRD-mbbjzQWLUFiBeKbzo6x27uUTg@mail.gmail.com>
Date: Tue, 15 Apr 2025 15:04:04 +0200
From: Jacob Melander <jmelander@...estorage.com>
To: netdev@...r.kernel.org
Subject: [BUG] permanently hung TCP connection when using tcp_shrink_window

Hi,

I believe I have found a bug in the handling of tcp_shrink_window --
when the option is in use and it shrinks the window to a sequence
before wnd_sl1 (which can happen under packet reordering/loss where a
previous packet has updated window without being acked), this can
cause the connection to be permanently hung since it's impossible to
receive window updates.

I have narrowed down this issue by adding extra fields to tcp_probe.
These are shown in my excerpts below and correspond to their
respective fields in the tcp socket struct. The th_* fields added are
the corresponding fields from the tcp header that triggered the
tcp_probe.

My setup is a client and a server communicating with each other over
RPC/NFS (although the protocol should be entirely irrelevant). The
client is using a 4.18-based kernel with patches (although from
everything I can see the patches do not affect any of this code). The
server is using a 6.6.9-based kernel with patches (once again, the
patches should not affect any of this code). One thing worth
mentioning is that the server side application does not infinitely
receive - if the peer does not consume enough data then the
application stops receiving from the connection (in order to prevent
infinite memory usage). The server is also using keepalives on the
connection - these do not help resolving the hang/deadlock.

So let's walk through it, starting a few minutes into a quite busy
connection (~300 MB/s).

Packet 1 - stream is functioning normally but we have sent out
zero-window due to our buffer filling up (server application received
too slow most likely).
The incoming packet should be dealt with through SLOW_PATH since we
have sent out a zero-window.
The packet itself is a pure ACK - no data
tcp_validate_incoming -> tcp_sequence passes since th_seq == rcv_wup
and th_seq == rcv_nxt.
snd_wnd is updated accordingly to 35968 (th_window * scaling factor).

<idle>-0       [190] ..s8. 35789.976850: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=0 snd_nxt=0xa7288b3f snd_una=0xa727e7bf snd_cwnd=44
ssthresh=33 snd_wnd=41856 srtt=96 rcv_wnd=0 sock_cookie=31703
th_window=8992 th_seq=0x8dfe11af th_ack_seq=0xa7280ccb
rcv_nxt=0x8dfe11af rcv_wup=0x8dfe11af snd_wl1=0x8dfe11af


packet 2 - stream is still functioning and we are still in zero-window.
The incoming packet should be dealt with through SLOW_PATH since we
have sent out a zero-window.
The packet itself is a pure ACK - no data
tcp_validate_incoming -> tcp_sequence passes since th_seq == rcv_wup
and th_seq == rcv_nxt.
The only thing that has changed here compared to packet 1 is that the
client acked more data
snd_wnd is updated accordingly to 17664.

<idle>-0       [190] ..s8. 35789.976939: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=0 snd_nxt=0xa7288b3f snd_una=0xa7280ccb snd_cwnd=44
ssthresh=33 snd_wnd=35968 srtt=94 rcv_wnd=0 sock_cookie=31703
th_window=4416 th_seq=0x8dfe11af th_ack_seq=0xa7288b3f
rcv_nxt=0x8dfe11af rcv_wup=0x8dfe11af snd_wl1=0x8dfe11af


packet 3 - stream is still functioning and we are no longer in zero-window
The incoming packet should be dealt with through SLOW_PATH since
th_seq != rcv_nxt
tcp_validate_incoming -> tcp_sequence passes since th_seq > rcv_wup
and th_seq > (rcv_nxt + rcv_wnd (0x8e05792f)).
snd_wnd is updated accordingly to 0 (the peer is now advertising zero-window!).

<idle>-0       [190] ..s8. 35789.977019: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=3808 snd_nxt=0xa728d03f snd_una=0xa7288b3f
snd_cwnd=44 ssthresh=33 snd_wnd=17664 srtt=97 rcv_wnd=485248
sock_cookie=31703 th_window=0 th_seq=0x8dfe7a8b th_ack_seq=0xa728d03f
rcv_nxt=0x8dfe11af rcv_wup=0x8dfe11af snd_wl1=0x8dfe11af


packet 4 - stream is still functioning, peer is now in zero-window, we
are not in zero-window
The incoming packet should be dealt with through SLOW_PATH since
th_seq != rcv_nxt.
tcp_validate_incoming -> tcp_sequence passes since th_seq > rcv_wup
and th_seq > (rcv_nxt + rcv_wnd (0x8e0567af)).
window advertised by peer is still 0.
note that we have now revoked part of the window we previously
advertised - i.e. the last valid seq is now lower.

<idle>-0       [190] ..s8. 35789.977125: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=33824 snd_nxt=0xa728d03f snd_una=0xa728d03f
snd_cwnd=44 ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=480768
sock_cookie=31703 th_window=0 th_seq=0x8dfe896b th_ack_seq=0xa728d03f
rcv_nxt=0x8dfe11af rcv_wup=0x8dfe11af snd_wl1=0x8dfe7a8b

packet 5-8 - This is where the stream starts breaking down. We have
now entered zero-window (server side application stopped receiving
because the peer wasn't receiving responses most likely).
The connection stays this way for the rest of its lifetime.
The incoming packet should be dealt with through SLOW_PATH since we
have sent out a zero-window (and th_seq != rcv_nxt).
tcp_validate_incoming -> tcp_sequence now fails since th_seq > rcv_nxt
+ rcv_wnd (rcv_wnd is 0), this packet will be discarded.
window advertised by peer is still 0 (but we can't note this down
anyways since we discarded the packet).

<idle>-0       [190] ..s8. 35789.977127: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=42736 snd_nxt=0xa728d03f snd_una=0xa728d03f
snd_cwnd=44 ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=0 sock_cookie=31703
th_window=0 th_seq=0x8dff0d8b th_ack_seq=0xa728d03f rcv_nxt=0x8dfe11af
rcv_wup=0x8dfe11af snd_wl1=0x8dfe896b
<idle>-0       [190] ..s8. 35789.977139: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=37488 snd_nxt=0xa728d03f snd_una=0xa728d03f
snd_cwnd=44 ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=0 sock_cookie=31703
th_window=0 th_seq=0x8dffb47b th_ack_seq=0xa728d03f rcv_nxt=0x8dfe11af
rcv_wup=0x8dfe11af snd_wl1=0x8dfe896b
<idle>-0       [190] ..s8. 35789.977141: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=26928 snd_nxt=0xa728d03f snd_una=0xa728d03f
snd_cwnd=44 ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=0 sock_cookie=31703
th_window=0 th_seq=0x8e0046eb th_ack_seq=0xa728d03f rcv_nxt=0x8dfe11af
rcv_wup=0x8dfe11af snd_wl1=0x8dfe896b
<idle>-0       [190] ..s8. 35789.977149: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=37488 snd_nxt=0xa728d03f snd_una=0xa728d03f
snd_cwnd=44 ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=0 sock_cookie=31703
th_window=0 th_seq=0x8e00b01b th_ack_seq=0xa728d03f rcv_nxt=0x8dfe11af
rcv_wup=0x8dfe11af snd_wl1=0x8dfe896b


packet 9 - peer starts advertising a non-zero window (most likely
because the client application caught up and started reading the data
sent to it).
The incoming packet should be dealt with through SLOW_PATH since we
have sent out a zero-window (and th_seq != rcv_nxt).
As with previous 3 packets, we discard it since th_seq > rcv_nxt +
rcv_wnd (rcv_wnd is 0).
As such, we don't recognize that the window has opened up.

<idle>-0       [190] ..s8. 35789.977152: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=0 snd_nxt=0xa728d03f snd_una=0xa728d03f snd_cwnd=44
ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=0 sock_cookie=31703
th_window=5024 th_seq=0x8e01428b th_ack_seq=0xa728d03f
rcv_nxt=0x8dfe11af rcv_wup=0x8dfe11af snd_wl1=0x8dfe896b


packet 10 - peer now retries sending actual data (even though we are
telling it the window is 0). I'm not sure how legal this is from a TCP
standpoint - I assume we cannot actually rely on peers doing this?
The incoming packet should be dealt with through SLOW_PATH since we
have sent out a zero-window
This packet is actually not discarded for once - th_seq == rcv_next,
and it will go on to be processed as far as I can tell.
Assuming this packet gets to tcp_ack_update_window (I have not fully
analyzed the path in between - maybe it doesn't
even get there), then in tcp_ack_update_window we will call
tcp_may_update_window.

If you are looking at the source here - note the following definitions
for ack_seq and ack respectively (they correspond to th_seq and
th_ack_seq -- naming is hard):
    u32 ack_seq = TCP_SKB_CB(skb)->seq;
    u32 ack = TCP_SKB_CB(skb)->ack_seq;

ack == th_ack_seq == snd_una
ack_seq == th_seq < snd_wl1
As such, tcp_may_update_window will discard the update since no new
data is acked and the peer has previously sent a
zero-window in a later ack_seq.

And the actual data in the packet is discarded since we are in
zero-window and we have no room to store the data.

<idle>-0       [190] ..s8. 35789.977152: tcp_probe: family=AF_INET6
src=[::ffff:192.168.8.74]:2049 dest=[::ffff:192.168.8.4]:33960
mark=0x0 data_len=8948 snd_nxt=0xa728d03f snd_una=0xa728d03f
snd_cwnd=44 ssthresh=33 snd_wnd=0 srtt=95 rcv_wnd=0 sock_cookie=31703
th_window=5024 th_seq=0x8dfe11af th_ack_seq=0xa728d03f
rcv_nxt=0x8dfe11af rcv_wup=0x8dfe11af snd_wl1=0x8dfe896b


>From here on out it's duplicates of the same pattern seen above. As
far as I can tell there is not any packet the peer could send us to
make us recognize it has opened up its window - either the packet seq
exceeds the valid window and gets discarded, or the packet seq comes
before snd_wl1 and is therefore not respected as a window update.

I did look through the code and tried to come up with some potential
fixes for the issue -- please have respect for me being a novice in
this area, so this may be mumbo jumbo, just thought it's hopefully at
least somewhat useful:
- Looking at the code, I wonder if ack seq could be used together with
seq for tracking the last window update, instead of merely relying on
seq. As I understand it - the reasoning behind snd_wl1 is to order
window updates with respect to each other - so that an earlier window
update is not received out of order with a later one causing an
unnecessarily pessimistic understanding of the window. In this regard
it should also hold true that a packet with an ack seq higher than a
previous packet is ordered strictly after that packet as sent from the
peer, right? Since I assume it is not legal to un-ack something which
has been acked. Assuming this, then perhaps the window could also be
updated when receiving a packet with an ack seq >= the latest ack seq?
Possibly also checking whether the new window is greater than the
previous one, so that only window increases are allowed using this
mechanism? Without updating packet discarding logic, I guess this does
rely on the peer doing a resend that exceeds the valid window though
(is that how all tcp peers behave)?
- Another idea would be that if it's possible to not early-discard
packets that are within the previously accepted window (before it was
shrunk), then it looks to me like the connection could eventually
recover via keepalive acks. This solution does feel unsatisfying
because there's still a stall of the connection until the keepalive
ack happens, but I think this would resolve the permanently hung
connection. I'm not sure how feasible it is though since it sounds
complicated to ensure all the paths can handle a packet which
should've been discarded in "normal conditions"?
- Yet another solution might be to never allow shrinking the window
past snd_wl1? I'm not sure about this one though because it relies on
the peer possibly resending data while we are advertising zero-window.
If it only does keepalive acks then those will also be outside of the
valid window and discarded - for this condition the peer would have to
send a packet at the magic seq of snd_wl1 to update the window (but at
least we won't discard the magic packet when the peer sends it).

Please let me know if any additional information is required.
I am able to test any patches as necessary, ideally mergeable on 6.6.9
if at all possible (that makes my life a lot easier). If not, I can
take a stab at merging them to 6.6.9 anyhow.

Thank you,
Jacob

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ