netdev - Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CADVnQyn=MXohOf1vskJcm9VTOeP31Y5AqCPu7B=zZuTB8nh8Eg@mail.gmail.com>
Date: Tue, 20 May 2025 23:04:40 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Simon Campion <simon.campion@...pl.com>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
	Yuchung Cheng <ycheng@...gle.com>, Kevin Yang <yyd@...gle.com>, Jon Maloy <jmaloy@...hat.com>
Subject: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK

cc += Jon Maloy <jmaloy@...hat.com>

On Mon, May 19, 2025 at 11:03 AM Simon Campion <simon.campion@...pl.com> wrote:
>
> Gladly! I attached the output of nstat -az. I ran it twice, right
> before a 602 byte retransmit was received and dropped, and right
> after, in case looking at the diff is helpful.

Thanks, Simon, for the data!

Skimming the data and the code for your kernel (6.6.83), I have a theory:

In your nstat data, we see TcpExtTCPZeroWindowDrop is incremented by 1
when the 602 byte retransmit was received and dropped:

> < TcpExtTCPZeroWindowDrop         485489             0.0
> ---
> > TcpExtTCPZeroWindowDrop         485490             0.0

That SNMP stat (corresponding to the SKB_DROP_REASON_TCP_ZEROWINDOW
drop reason Simon mentioned earlier) is incremented by
tcp_data_queue() when an in-order packet arrives and
tcp_receive_window(tp) == 0, and the packet is dropped.

But, critically, tcp_data_queue() in that code path does not call
tcp_try_rmem_schedule() to try to free up memory.

Why is tcp_receive_window(tp) == 0 in this case? A conjecture:

(a) I bet  the machine was probably under memory pressure earlier,
triggering ICSK_ACK_NOMEM

(b) We can see your kernel 6.6.83 has a backport of the recent bug fix
patch that sets tp->rcv_wnd = 0 upon ICSK_ACK_NOMEM events:

commit b01e7ceb35dcb7ffad413da657b78c3340a09039
Author: Jon Maloy <jmaloy@...hat.com>
Date:   Mon Jan 27 18:13:04 2025 -0500

    tcp: correct handling of extreme memory squeeze

    [ Upstream commit 8c670bdfa58e48abad1d5b6ca1ee843ca91f7303 ]

...

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfddc94508f0b..3771ed22c2f56 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -263,11 +263,14 @@ static u16 tcp_select_window(struct sock *sk)
        u32 cur_win, new_win;

        /* Make the window 0 if we failed to queue the data because we
-        * are out of memory. The window is temporary, so we don't store
-        * it on the socket.
+        * are out of memory.
         */
-       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM))
+       if (unlikely(inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM)) {
+               tp->pred_flags = 0;
+               tp->rcv_wnd = 0;
+               tp->rcv_wup = tp->rcv_nxt;
                return 0;
+       }

---

Putting this all together, a conjecture about what happened:

+ the machine was under memory pressure, so  triggered ICSK_ACK_NOMEM

+ this caused the new  "tcp: correct handling of extreme memory
squeeze" patch to set tp->rcv_wnd = 0

+ this caused tcp_data_queue() to see the in-order packet arrive and
tcp_receive_window(tp) == 0, and the packet is dropped.with
TcpExtTCPZeroWindowDrop

+ tcp_data_queue() in that code path does not call
tcp_try_rmem_schedule() to try to free up memory

+ so even if more memory was available at this point,
tcp_try_rmem_schedule() is not called, because of the new "tcp:
correct handling of extreme memory squeeze" patch

I suppose one possible fix would be to change tcp_data_queue() in that
(tcp_receive_window(tp) == 0) case, to make sure it calls
tcp_try_rmem_schedule() to try to free up memory.

Eric and Jon, WDYT?

It's a bit past my bedtime here in NYC so I may not be thinking straight.... :-)

thanks,
neal