netdev - Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iJeXXJV-D5g3+hqStM1sH0UZ3jDeZmOu9mM_E_i9ZYaeA@mail.gmail.com>
Date: Mon, 21 Jul 2025 06:52:30 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, Neal Cardwell <ncardwell@...gle.com>, 
	Kuniyuki Iwashima <kuniyu@...gle.com>, "David S. Miller" <davem@...emloft.net>, 
	David Ahern <dsahern@...nel.org>, Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>, 
	Matthieu Baerts <matttbe@...nel.org>
Subject: Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer

On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@...hat.com> wrote:
>
> On 7/21/25 2:30 PM, Eric Dumazet wrote:
> > On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >> On 7/21/25 10:04 AM, Eric Dumazet wrote:
> >>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@...hat.com> wrote:
> >>>>
> >>>> The nipa CI is reporting frequent failures in the mptcp_connect
> >>>> self-tests.
> >>>>
> >>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
> >>>> actually plain TCP ones, as fallback for passive socket at 2whs
> >>>> time cause the MPTCP listener to actually create a TCP socket.
> >>>>
> >>>> The transfer is stuck due to the receiver buffer being zero.
> >>>> With the stronger check in place, tcp_clamp_window() can be invoked
> >>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
> >>>> will be zeroed, too.
> >>>>
> >>>> Pass to tcp_clamp_window() even the current skb truesize, so that
> >>>> such helper could compute and use the actual limit enforced by
> >>>> the stack.
> >>>>
> >>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
> >>>> Signed-off-by: Paolo Abeni <pabeni@...hat.com>
> >>>> ---
> >>>>  net/ipv4/tcp_input.c | 12 ++++++------
> >>>>  1 file changed, 6 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>> index 672cbfbdcec1..c98de02a3c57 100644
> >>>> --- a/net/ipv4/tcp_input.c
> >>>> +++ b/net/ipv4/tcp_input.c
> >>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
> >>>>  }
> >>>>
> >>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
> >>>> -static void tcp_clamp_window(struct sock *sk)
> >>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
> >>>
> >>>
> >>> I am unsure about this one. truesize can be 1MB here, do we want that
> >>> in general ?
> >>
> >> I'm unsure either. But I can't think of a different approach?!? If the
> >> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
> >> size to accept it, right?
> >
> > What I meant was :
> >
> > This is the generic point, accepting skb->truesize as additional input
> > here would make us more vulnerable, or we could risk other
> > regressions.
>
> Understood, thanks for the clarification.
>
> > The question is : why does MPTCP end up here in the first place.
> > Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
>
> I collected a few more data. The issue happens even with plain TCP
> sockets[1].
>
> The relevant transfer is on top of the loopback device. The scaling_rate
> rapidly grows to 254 - that is `truesize` and `len` are very near.
>
> The stall happens when the received get in a packet with a slightly less
> 'efficient' layout (in the experiment I have handy len is 71424,
> truesize 72320) (almost) filling the receiver window.
>
> On such input, tcp_clamp_window() shrinks the receiver buffer to the
> current rmem usage. The same happens on retransmissions until rcvbuf
> becomes 0.
>
> I *think* that catching only the !sk_rmem_alloc case would avoid the
> stall, but I think it's a bit 'late'.

A packetdrill test here would help understanding your concern.

> I'm unsure if we could
> preventing/forbidding 'too high' values of scaling_rate? (also I'm
> unsure where to draw the line exactly.

Indeed we need to account for a possible variation (ie reduction) of
skb->len/skb->truesize ratio in future packets.

Note that whatever conservative change we make, it will always be
possible to feed packets until RWIN 0 is sent back,
eventually after a bump caused by a prune operation.

Imagine wscale is 8, and a prior RWIN 1 was sent.

Normally we should be able to receive a packet with 256 bytes of
payload, but typical skb->truesize
will be 256+512+4096 (for a NIC driver using 4K pages).

This is one of the reasons we force wscale to 12 in Google DC :)

>
> Cheers,
>
> Paolo
>
>
> [1] You can run the relevant test by adding '-t' on the mptcp_connect.sh
> command line, but it will take a lot of time to run the 10-20 iterations
> I need to observe the issue. To make it faster I manually trimmed the
> not relevant test-cases.
>