netdev - Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <cc1cb5c2-9652-4b01-9008-22965685b73b@redhat.com>
Date: Mon, 21 Jul 2025 18:17:29 +0200
From: Paolo Abeni <pabeni@...hat.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev@...r.kernel.org, Neal Cardwell <ncardwell@...gle.com>,
 Kuniyuki Iwashima <kuniyu@...gle.com>, "David S. Miller"
 <davem@...emloft.net>, David Ahern <dsahern@...nel.org>,
 Jakub Kicinski <kuba@...nel.org>, Simon Horman <horms@...nel.org>,
 Matthieu Baerts <matttbe@...nel.org>
Subject: Re: [PATCH net-next 1/2] tcp: do not set a zero size receive buffer

On 7/21/25 5:21 PM, Eric Dumazet wrote:
> On Mon, Jul 21, 2025 at 7:56 AM Paolo Abeni <pabeni@...hat.com> wrote:
>> On 7/21/25 3:52 PM, Eric Dumazet wrote:
>>> On Mon, Jul 21, 2025 at 6:32 AM Paolo Abeni <pabeni@...hat.com> wrote:
>>>> On 7/21/25 2:30 PM, Eric Dumazet wrote:
>>>>> On Mon, Jul 21, 2025 at 3:50 AM Paolo Abeni <pabeni@...hat.com> wrote:
>>>>>> On 7/21/25 10:04 AM, Eric Dumazet wrote:
>>>>>>> On Fri, Jul 18, 2025 at 10:25 AM Paolo Abeni <pabeni@...hat.com> wrote:
>>>>>>>>
>>>>>>>> The nipa CI is reporting frequent failures in the mptcp_connect
>>>>>>>> self-tests.
>>>>>>>>
>>>>>>>> In the failing scenarios (TCP -> MPTCP) the involved sockets are
>>>>>>>> actually plain TCP ones, as fallback for passive socket at 2whs
>>>>>>>> time cause the MPTCP listener to actually create a TCP socket.
>>>>>>>>
>>>>>>>> The transfer is stuck due to the receiver buffer being zero.
>>>>>>>> With the stronger check in place, tcp_clamp_window() can be invoked
>>>>>>>> while the TCP socket has sk_rmem_alloc == 0, and the receive buffer
>>>>>>>> will be zeroed, too.
>>>>>>>>
>>>>>>>> Pass to tcp_clamp_window() even the current skb truesize, so that
>>>>>>>> such helper could compute and use the actual limit enforced by
>>>>>>>> the stack.
>>>>>>>>
>>>>>>>> Fixes: 1d2fbaad7cd8 ("tcp: stronger sk_rcvbuf checks")
>>>>>>>> Signed-off-by: Paolo Abeni <pabeni@...hat.com>
>>>>>>>> ---
>>>>>>>>  net/ipv4/tcp_input.c | 12 ++++++------
>>>>>>>>  1 file changed, 6 insertions(+), 6 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>>>>> index 672cbfbdcec1..c98de02a3c57 100644
>>>>>>>> --- a/net/ipv4/tcp_input.c
>>>>>>>> +++ b/net/ipv4/tcp_input.c
>>>>>>>> @@ -610,24 +610,24 @@ static void tcp_init_buffer_space(struct sock *sk)
>>>>>>>>  }
>>>>>>>>
>>>>>>>>  /* 4. Recalculate window clamp after socket hit its memory bounds. */
>>>>>>>> -static void tcp_clamp_window(struct sock *sk)
>>>>>>>> +static void tcp_clamp_window(struct sock *sk, int truesize)
>>>>>>>
>>>>>>>
>>>>>>> I am unsure about this one. truesize can be 1MB here, do we want that
>>>>>>> in general ?
>>>>>>
>>>>>> I'm unsure either. But I can't think of a different approach?!? If the
>>>>>> incoming truesize is 1M the socket should allow for at least 1M rcvbuf
>>>>>> size to accept it, right?
>>>>>
>>>>> What I meant was :
>>>>>
>>>>> This is the generic point, accepting skb->truesize as additional input
>>>>> here would make us more vulnerable, or we could risk other
>>>>> regressions.
>>>>
>>>> Understood, thanks for the clarification.
>>>>
>>>>> The question is : why does MPTCP end up here in the first place.
>>>>> Perhaps an older issue with an incorrectly sized sk_rcvbuf ?
>>>>
>>>> I collected a few more data. The issue happens even with plain TCP
>>>> sockets[1].
>>>>
>>>> The relevant transfer is on top of the loopback device. The scaling_rate
>>>> rapidly grows to 254 - that is `truesize` and `len` are very near.
>>>>
>>>> The stall happens when the received get in a packet with a slightly less
>>>> 'efficient' layout (in the experiment I have handy len is 71424,
>>>> truesize 72320) (almost) filling the receiver window.
>>>>
>>>> On such input, tcp_clamp_window() shrinks the receiver buffer to the
>>>> current rmem usage. The same happens on retransmissions until rcvbuf
>>>> becomes 0.
>>>>
>>>> I *think* that catching only the !sk_rmem_alloc case would avoid the
>>>> stall, but I think it's a bit 'late'.
>>>
>>> A packetdrill test here would help understanding your concern.
>>
>> I fear like a complete working script would take a lot of time, let me
>> try to sketch just the relevant part:
>>
>> # receiver state is:
>> # rmem=110592 rcvbuf=174650 scaling_ratio=253 rwin=63232
>> # no OoO data, no memory pressure,
>>
>> # the incoming packet is in sequence
>> +0 > P. 109297:172528(63232) ack 1
>>
>> With just the 0 rmem check in tcp_prune_queue(), such function will
>> still invoke tcp_clamp_window() that will shrink the receive buffer to
>> 110592.
> 
> As long as an ACK is sent back with a smaller RWIN, I think this would
> be reasonable in this case.

I fear some possible regression, as the sender will see some unexpected
drops, even on loopback and while not misbehaving.

But I tested your proposed code here and AFAICS solves the issue and I
could not spot anything suspicious so far.

So I'll send a v2 using that.

Thanks!

Paolo