[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iLwpjs7-1qZ+wvFsav_Th9_PJvHvgfWPhz3wxUJwRx70Q@mail.gmail.com>
Date: Mon, 21 Jul 2025 08:21:38 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, Neal Cardwell <ncardwell@...gle.com>,
Kuniyuki Iwashima <kuniyu@...gle.com>, "David S. Miller" <davem@...emloft.net>,
David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>,
Matthieu Baerts <matttbe@...nel.org>
Subject: Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer
On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 7/21/25 3:52 PM, Eric Dumazet wrote:
> > On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >> On 7/21/25 2:30 PM, Eric Dumazet wrote:
> >>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >>>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
> >>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >>>>>>
> >>>>>> The nipa CI is reporting frequent failures in the mptcp_connect
> >>>>>> self-tests.
> >>>>>>
> >>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> >>>>>> actually plain TCP ones, as fallback for passive socket at 2whs
> >>>>>> time cause the MPTCP listener to actually create a TCP socket.
> >>>>>>
> >>>>>> The transfer is stuck due to the receiver buffer being zero.
> >>>>>> With the stronger check in place, tcp_clamp_window() can be invoked
> >>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> >>>>>> will be zeroed, too.
> >>>>>>
> >>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
> >>>>>> such helper could compute and use the actual limit enforced by
> >>>>>> the stack.
> >>>>>>
> >>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> >>>>>> Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> >>>>>> ---
> >>>>>> net/ipv4/tcp_input.c | 12 ++++++------
> >>>>>> 1 file changed, 6 insertions(+), 6 deletions(-)
> >>>>>>
> >>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>>>> index 672cbfbdcec1..c98de02a3c57 100644
> >>>>>> --- a/net/ipv4/tcp_input.c
> >>>>>> +++ b/net/ipv4/tcp_input.c
> >>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> >>>>>> }
> >>>>>>
> >>>>>> /* 4. Recalculate window clamp after socket hit its memory bounds. */
> >>>>>> -static void tcp_clamp_window(struct sock *sk)
> >>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
> >>>>>
> >>>>>
> >>>>> I am unsure about this one. truesize can be 1MB here, do we want that
> >>>>> in general ?
> >>>>
> >>>> I'm unsure either. But I can't think of a different approach?!? If the
> >>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> >>>> size to accept it, right?
> >>>
> >>> What I meant was :
> >>>
> >>> This is the generic point, accepting skb->truesize as additional input
> >>> here would make us more vulnerable, or we could risk other
> >>> regressions.
> >>
> >> Understood, thanks for the clarification.
> >>
> >>> The question is : why does MPTCP end up here in the first place.
> >>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
> >>
> >> I collected a few more data. The issue happens even with plain TCP
> >> sockets[1].
> >>
> >> The relevant transfer is on top of the loopback device. The scaling_rate
> >> rapidly grows to 254 - that is `truesize` and `len` are very near.
> >>
> >> The stall happens when the received get in a packet with a slightly less
> >> 'efficient' layout (in the experiment I have handy len is 71424,
> >> truesize 72320) (almost) filling the receiver window.
> >>
> >> On such input, tcp_clamp_window() shrinks the receiver buffer to the
> >> current rmem usage. The same happens on retransmissions until rcvbuf
> >> becomes 0.
> >>
> >> I *think* that catching only the !sk_rmem_alloc case would avoid the
> >> stall, but I think it's a bit 'late'.
> >
> > A packetdrill test here would help understanding your concern.
>
> I fear like a complete working script would take a lot of time, let me
> try to sketch just the relevant part:
>
> # receiver state is:
> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
> # no OoO data, no memory pressure,
>
> # the incoming packet is in sequence
> +0 > P. 109297:172528(63232) ack 1
>
> With just the 0 rmem check in tcp_prune_queue(), such function will
> still invoke tcp_clamp_window() that will shrink the receive buffer to
> 110592.
As long as an ACK is sent back with a smaller RWIN, I think this would
be reasonable in this case.
> tcp_collapse() can't make enough room and the incoming packet will be
> dropped. I think we should instead accept such packet.
Only if not completely off-the-limits...
packetdrill test :
cat eric.pkt
// Test the calculation of receive window values by a bulk data receiver.
--mss=1000
// Set up config.
// Need to set tcp_rmem[1] to what tcp_fixup_rcvbuf() would have set to
// make sure the same kernel behavior after removing tcp_fixup_rcvbuf()
`../common/defaults.sh
../common/set_sysctls.py /proc/sys/net/ipv4/tcp_rmem="4096 131072 15728640"
`
// Create a socket.
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
// Verify that the receive buffer is the tcp_rmem default.
+0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0
+0 bind(3, ..., ...) = 0
+0 listen(3, 1) = 0
// Establish a connection.
+.01 < S 0:0(0) win 65535 <mss 1000,nop,nop,sackOK,nop,wscale 6>
+0 > S. 0:0(0) ack 1 win 64240 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.01 < . 1:1(0) ack 1 win 457
+0 accept(3, ..., ...) = 4
// Verify that the receive buffer is the tcp_rmem default.
+0 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [131072], [4]) = 0
// Check first outgoing window after SYN
+.01 write(4, ..., 1000) = 1000
+0 > P. 1:1001(1000) ack 1 win 251
// Phase 1: Data arrives but app doesn't read from the socket buffer.
+.01 < . 1:60001(60000) ack 1001 win 457
+0 > . 1001:1001(0) ack 60001 win 263
+.01 < . 60001:120001(60000) ack 1001 win 457
+0~+.04 > . 1001:1001(0) ack 120001 win 29
// Incoming packet has a too big skb->truesize, lets send a lower RWIN
+.01 < . 120001:127425(7424) ack 1001 win 457
+0~+.04 > . 1001:1001(0) ack 120001 win 0
// Reset to sysctls defaults
`/tmp/sysctl_restore_${PPID}.sh`
Powered by blists - more mailing lists