netdev - Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQynDkHVmTdnMZ+ZvDtwF9EVcOOphDbr+eLUMBijbc+2nQw@mail.gmail.com>
Date: Thu, 22 May 2025 09:02:36 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Simon Campion <simon.campion@...pl.com>
Cc: netdev@...r.kernel.org, Eric Dumazet <edumazet@...gle.com>, 
	Yuchung Cheng <ycheng@...gle.com>, Kevin Yang <yyd@...gle.com>, Jon Maloy <jmaloy@...hat.com>
Subject: Re: Re: [EXT] Re: tcp: socket stuck with zero receive window after SACK

On Thu, May 22, 2025 at 6:34 AM Simon Campion <simon.campion@...pl.com> wrote:
>
> On Wed, 21 May 2025 at 17:56, Neal Cardwell <ncardwell@...gle.com> wrote:
> > For my education, why do you set net.ipv4.tcp_shrink_window=1?
>
> We enabled it mainly as an attempt to decrease the frequency of a
> different issue in which jumbo frames were dropped indefinitely on a
> host, presumably after memory pressure, discussed in [1]. The jumbo
> frame issue is most likely triggered by system-wide memory pressure
> rather than hitting net.ipv4.tcp_mem. So,
> net.ipv4.tcp_shrink_window=1, which, as far as we understand, makes
> hitting net.ipv4.tcp_mem less likely, probably didn't help with
> decreasing the frequency of the jumbo frame issue. But the issue had
> sufficiently serious impact and we were sufficiently unsure about the
> root cause that we deemed net.ipv4.tcp_shrink_window=1 worth a try.
> (Also, the rationale behind net.ipv4.tcp_shrink_window=1 laid out in
> [2] and [3] sounded reasonable.)
>
> But yes, it's feasible for us to revert to the default
> net.ipv4.tcp_shrink_window=0, in particular because there's another
> workaround for the jumbo frame issue: reduce the MTU. We've set
> net.ipv4.tcp_shrink_window=0 yesterday and haven't seen the issue
> since. So:
>
> 6.6.74 + net.ipv4.tcp_shrink_window=1: issue occurs
> 6.6.83 + net.ipv4.tcp_shrink_window=1: issue occurs
> 6.6.74 + net.ipv4.tcp_shrink_window=0: no issue so far
> 6.6.83 + net.ipv4.tcp_shrink_window=0: no issue so far
>
> Since the issue occurred sporadically, it's too soon to be fully
> confident that it's gone with net.ipv4.tcp_shrink_window=0. We'll
> write again in a week or so to confirm.

Thanks for the data points and testing!

I agree it will take a while to gather more confidence that the issue
is gone for your workload with net.ipv4.tcp_shrink_window=0.

Based on your data, my current sense is that for your workload the
buggy behavior was triggered by net.ipv4.tcp_shrink_window=1.

However, AFAICT with the current code there could be similar problems
with the default net.ipv4.tcp_shrink_window=0 setting if the socket
suffers a memory pressure event while there is a tiny amount of free
receive buffer.

> If net.ipv4.tcp_shrink_window=1 seems to have caused this issue, we'd
> still be curious to understand why it leads to TCP connections being
> stuck indefinitely even though the recv-q (as reported by ss) is 0.
> Assuming the recv-q was indeed correctly reported as 0, the issue
> might be that receive buffers can fill up in a way so that the only
> way for data to leave the receive buffer is receipt of further data.
> In particular, the application can't read data out of the receive
> buffer and empty it that way. Maybe filling up buffers with data
> received out-of-order (whether we SACK it or not) satisfies this
> condition. This would explain why we saw this issue only in the
> presence of SACK flags before we disabled SACK. With
> net.ipv4.tcp_shrink_window=1, a full receive buffer leads to a zero
> window being advertised (see [2]) and if the buffer filled up in a way
> so that no data can leave until further data is received, we are stuck
> forever because the kernel drops incoming data due to the zero window.
> In contrast, with ipv4.tcp_shrink_window=0, we will keep advertising a
> non-zero window, so incoming data isn't dropped and we can have data
> leave the receive buffer.

Yes, this matches my theory of the case as well.

Except I would add that with ipv4.tcp_shrink_window=0, AFAICT with
recent kernels a receiver can get into a situation where a memory
pressure event while there is a tiny amount of free receive buffer can
cause the receiver to set tp->rcv_wnd to 0, and thus get into a
similar situation where the receiver (due to the zero window) will
keep advertising a zero window and dropping incoming data without
pruning SACKed skbs to make room in the receive buffer.

(It sounds like in your case the net.ipv4.tcp_shrink_window=1 is
triggering the problem rather than the memory pressure issue.)

> ... I'm speculating here; once we confirm that
> the issue seems to have been triggered by
> net.ipv4.tcp_shrink_window=1, I'd be keen to hear other thoughts as to
> why the setting may have this effect in certain environments.

I suspect the environmental factors that make your workload
susceptible to these issues are related to

+ the amount of space used by the NIC driver on the receiver to hold
incoming packets may be large relative to the rcvmss

+ the variation in the incoming packet sizes (the hole was 602 bytes
when the rcvmss is a larger 1434 bytes) may be causing challenges

+ the packet loss is definitely causing challenges for the algorithm,
since the SACKed out-of-order data can eat up most of the space needed
to buffer the packet to fill the hole and allow the app to read the
data out of the receive buffer to free up more space

Thanks,
neal


> [1] https://marc.info/?l=linux-netdev&m=174600337131981&w=2
> [2] https://github.com/torvalds/linux/commit/b650d953cd391595e536153ce30b4aab385643ac
> [3] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/