netdev - Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLwpjs7-1qZ+wvFsav_Th9_PJvHvgfWPhz3wxUJwRx70Q@mail.gmail.com>
Date: Mon, 21 Jul 2025 08:21:38 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, Neal Cardwell <ncardwell@...gle.com>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>, 
	Matthieu Baerts <matttbe@...nel.org>
Subject: Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer

On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 7/21/25 3:52 PM, Eric Dumazet wrote:
> > On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >> On 7/21/25 2:30 PM, Eric Dumazet wrote:
> >>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >>>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
> >>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >>>>>>
> >>>>>> The nipa CI is reporting frequent failures in the mptcp_connect
> >>>>>> self-tests.
> >>>>>>
> >>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> >>>>>> actually plain TCP ones, as fallback for passive socket at 2whs
> >>>>>> time cause the MPTCP listener to actually create a TCP socket.
> >>>>>>
> >>>>>> The transfer is stuck due to the receiver buffer being zero.
> >>>>>> With the stronger check in place, tcp_clamp_window() can be invoked
> >>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> >>>>>> will be zeroed, too.
> >>>>>>
> >>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
> >>>>>> such helper could compute and use the actual limit enforced by
> >>>>>> the stack.
> >>>>>>
> >>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> >>>>>> Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> >>>>>> ---
> >>>>>>  net/ipv4/tcp_input.c | 12 ++++++------
> >>>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>>>> index 672cbfbdcec1..c98de02a3c57 100644
> >>>>>> --- a/net/ipv4/tcp_input.c
> >>>>>> +++ b/net/ipv4/tcp_input.c
> >>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> >>>>>>  }
> >>>>>>
> >>>>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> >>>>>> -static void tcp_clamp_window(struct sock *sk)
> >>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
> >>>>>
> >>>>>
> >>>>> I am unsure about this one. truesize can be 1MB here, do we want that
> >>>>> in general ?
> >>>>
> >>>> I'm unsure either. But I can't think of a different approach?!? If the
> >>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> >>>> size to accept it, right?
> >>>
> >>> What I meant was :
> >>>
> >>> This is the generic point, accepting skb->truesize as additional input
> >>> here would make us more vulnerable, or we could risk other
> >>> regressions.
> >>
> >> Understood, thanks for the clarification.
> >>
> >>> The question is : why does MPTCP end up here in the first place.
> >>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
> >>
> >> I collected a few more data. The issue happens even with plain TCP
> >> sockets[1].
> >>
> >> The relevant transfer is on top of the loopback device. The scaling_rate
> >> rapidly grows to 254 - that is `truesize` and `len` are very near.
> >>
> >> The stall happens when the received get in a packet with a slightly less
> >> 'efficient' layout (in the experiment I have handy len is 71424,
> >> truesize 72320) (almost) filling the receiver window.
> >>
> >> On such input, tcp_clamp_window() shrinks the receiver buffer to the
> >> current rmem usage. The same happens on retransmissions until rcvbuf
> >> becomes 0.
> >>
> >> I *think* that catching only the !sk_rmem_alloc case would avoid the
> >> stall, but I think it's a bit 'late'.
> >
> > A packetdrill test here would help understanding your concern.
>
> I fear like a complete working script would take a lot of time, let me
> try to sketch just the relevant part:
>
> # receiver state is:
> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
> # no OoO data, no memory pressure,
>
> # the incoming packet is in sequence
> +0 > P. 109297:172528(63232) ack 1
>
> With just the 0 rmem check in tcp_prune_queue(), such function will
> still invoke tcp_clamp_window() that will shrink the receive buffer to
> 110592.

As long as an ACK is sent back with a smaller RWIN, I think this would
be reasonable in this case.

> tcp_collapse() can't make enough room and the incoming packet will be
> dropped. I think we should instead accept such packet.

Only if not completely off-the-limits...

packetdrill test :

cat eric.pkt
// Test the calculation of receive window values by a bulk data receiver.

--mss=1000

// Set up config.
// Need to set tcp_rmem[1] to what tcp_fixup_rcvbuf() would have set to
// make sure the same kernel behavior after removing tcp_fixup_rcvbuf()
`../common/defaults.sh
../common/set_sysctls.py /proc/sys/net/ipv4/tcp_rmem="4096 131072 15728640"
`

// Create a socket.
    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0

// Verify that the receive buffer is the tcp_rmem default.
   +0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0

   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

// Establish a connection.
 +.01 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
   +0 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 8>
 +.01 < . 1:1(0) ack 1 win 457
   +0 accept(3, ..., ...) = 4

// Verify that the receive buffer is the tcp_rmem default.
   +0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0

// Check first outgoing window after SYN
 +.01 write(4, ..., 1000) = 1000
   +0 > P. 1:1001(1000) ack 1 win 251

// Phase 1: Data arrives but app doesn't read from the socket buffer.

 +.01 < . 1:60001(60000) ack 1001 win 457
   +0 > . 1001:1001(0) ack 60001 win 263


 +.01 < . 60001:120001(60000) ack 1001 win 457
+0~+.04 > . 1001:1001(0) ack 120001 win 29

// Incoming packet  has a too big skb->truesize, lets send a lower RWIN
 +.01 < . 120001:127425(7424) ack 1001 win 457
+0~+.04 > . 1001:1001(0) ack 120001 win 0

// Reset to sysctls defaults
`/tmp/sysctl_restore_${PPID}.sh`