[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAL+tcoDSf9AQv5h-EB9mSdLqBP7W63Eejh+4PYOSiVATGa-Mgg@mail.gmail.com>
Date: Fri, 7 Mar 2025 08:31:57 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: "Matthieu Baerts (NGI0)" <matttbe@...nel.org>, mptcp@...ts.linux.dev,
Neal Cardwell <ncardwell@...gle.com>, Kuniyuki Iwashima <kuniyu@...zon.com>,
"David S. Miller" <davem@...emloft.net>, David Ahern <dsahern@...nel.org>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Simon Horman <horms@...nel.org>,
netdev@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH net-next] tcp: clamp window like before the cleanup
On Thu, Mar 6, 2025 at 7:18 PM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> On Thu, Mar 6, 2025 at 5:45 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Mar 6, 2025 at 6:22 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > On Wed, Mar 5, 2025 at 10:49 PM Matthieu Baerts (NGI0)
> > > <matttbe@...nel.org> wrote:
> > > >
> > > > A recent cleanup changed the behaviour of tcp_set_window_clamp(). This
> > > > looks unintentional, and affects MPTCP selftests, e.g. some tests
> > > > re-establishing a connection after a disconnect are now unstable.
> > > >
> > > > Before the cleanup, this operation was done:
> > > >
> > > > new_rcv_ssthresh = min(tp->rcv_wnd, new_window_clamp);
> > > > tp->rcv_ssthresh = max(new_rcv_ssthresh, tp->rcv_ssthresh);
> > > >
> > > > The cleanup used the 'clamp' macro which takes 3 arguments -- value,
> > > > lowest, and highest -- and returns a value between the lowest and the
> > > > highest allowable values. This then assumes ...
> > > >
> > > > lowest (rcv_ssthresh) <= highest (rcv_wnd)
> > > >
> > > > ... which doesn't seem to be always the case here according to the MPTCP
> > > > selftests, even when running them without MPTCP, but only TCP.
> > > >
> > > > For example, when we have ...
> > > >
> > > > rcv_wnd < rcv_ssthresh < new_rcv_ssthresh
> > > >
> > > > ... before the cleanup, the rcv_ssthresh was not changed, while after
> > > > the cleanup, it is lowered down to rcv_wnd (highest).
> > > >
> > > > During a simple test with TCP, here are the values I observed:
> > > >
> > > > new_window_clamp (val) rcv_ssthresh (lo) rcv_wnd (hi)
> > > > 117760 (out) 65495 < 65536
> > > > 128512 (out) 109595 > 80256 => lo > hi
> > > > 1184975 (out) 328987 < 329088
> > > >
> > > > 113664 (out) 65483 < 65536
> > > > 117760 (out) 110968 < 110976
> > > > 129024 (out) 116527 > 109696 => lo > hi
> > > >
> > > > Here, we can see that it is not that rare to have rcv_ssthresh (lo)
> > > > higher than rcv_wnd (hi), so having a different behaviour when the
> > > > clamp() macro is used, even without MPTCP.
> > > >
> > > > Note: new_window_clamp is always out of range (rcv_ssthresh < rcv_wnd)
> > > > here, which seems to be generally the case in my tests with small
> > > > connections.
> > > >
> > > > I then suggests reverting this part, not to change the behaviour.
> > > >
> > > > Fixes: 863a952eb79a ("tcp: tcp_set_window_clamp() cleanup")
> > > > Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/551
> > > > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@...nel.org>
> > >
> > > Tested-by: Jason Xing <kerneljasonxing@...il.com>
> > >
> > > Thanks for catching this. I should have done more tests :(
> > >
> > > Now I use netperf with TCP_CRR to test loopback and easily see the
> > > case where tp->rcv_ssthresh is larger than tp->rcv_wnd, which means
> > > tp->rcv_wnd is not the upper bound as you said.
> > >
> > > Thanks,
> > > Jason
> > >
> >
> > Patch looks fine to me but all our tests are passing with the current kernel,
> > and I was not able to trigger the condition.
> >
> > Can you share what precise test you did ?
> >
> > Thanks !
>
> I did the test[1] on the virtual machine running the kernel [2]. And
> after seeing your reply, I checked out a clean branch and compiled the
> kernel with the patch reverted again and rebooted. The case can still
> be reliably reproduced in my machine. Here are some outputs from BPF
> program[3]:
> sudo bpftrace tcp_cap.bt.2
> Attaching 1 probe...
> 4327813, 4326775, 4310912
> netperf
> tcp_set_window_clamp+1
> tcp_data_queue+1744
> tcp_rcv_established+501
> tcp_v4_do_rcv+369
> tcp_v4_rcv+4800
> ip_protocol_deliver_rcu+65
>
> 4327813, 4326827, 4310912
> netperf
> tcp_set_window_clamp+1
> tcp_data_queue+1744
> tcp_rcv_established+501
> tcp_v4_do_rcv+369
> tcp_v4_rcv+4800
> ip_protocol_deliver_rcu+65
>
> 418081, 417052, 417024
> swapper/11
> tcp_set_window_clamp+1
> tcp_data_queue+1744
> tcp_rcv_established+501
> tcp_v4_do_rcv+369
> tcp_v4_rcv+4800
> ip_protocol_deliver_rcu+65
>
Hi Eric, Matthieu
I did a quick analysis on this case. It turned out that
__release_sock() was dealing with skbs in sk_backlog one by one, then:
1) one skb went into tcp_grow_window(). At the beginning, rcv_ssthresh
is equal to rcv_wnd, but later rcv_ssthresh will increase by 'incr',
which means updated rcv_ssthresh is larger than rcv_wnd.
2) another skb went into tcp_set_window_clamp(), as I saw yesterday,
the issue happened: rcv_ssthresh > rcv_wnd.
As to the rcv_wnd, in synchronised states, it can be only
changed/adjusted to new_win in tcp_select_window(). Thus, between
above 1) and 2) the sk didn't have the chance to update its rcv_wnd
value, so...
I attached two consecutive calltraces here:
[Fri Mar 7 08:00:52 2025] netserver, 1, 91234, 65483, (25751,130966)
[Fri Mar 7 08:00:52 2025] CPU: 0 UID: 266980 PID: 17465 Comm:
netserver Kdump: loaded Not tainted 6.14.0-rc4+ #415
[Fri Mar 7 08:00:52 2025] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[Fri Mar 7 08:00:52 2025] Call Trace:
[Fri Mar 7 08:00:52 2025] <TASK>
[Fri Mar 7 08:00:52 2025] dump_stack_lvl+0x5b/0x70
[Fri Mar 7 08:00:52 2025] dump_stack+0x10/0x20
[Fri Mar 7 08:00:52 2025] tcp_grow_window+0x297/0x320
[Fri Mar 7 08:00:52 2025] tcp_event_data_recv+0x265/0x400
[Fri Mar 7 08:00:52 2025] tcp_data_queue+0x6d0/0xc70
[Fri Mar 7 08:00:52 2025] tcp_rcv_established+0x1f5/0x760
[Fri Mar 7 08:00:52 2025] ? schedule_timeout+0xe5/0x100
[Fri Mar 7 08:00:52 2025] tcp_v4_do_rcv+0x171/0x2d0
[Fri Mar 7 08:00:52 2025] __release_sock+0xd1/0xe0
[Fri Mar 7 08:00:52 2025] release_sock+0x30/0xa0
[Fri Mar 7 08:00:52 2025] inet_accept+0x5c/0x80
[Fri Mar 7 08:00:52 2025] do_accept+0xf1/0x180
.....
[Fri Mar 7 08:00:52 2025] netserver, 0, 91234, 65483
[Fri Mar 7 08:00:52 2025] CPU: 0 UID: 266980 PID: 17465 Comm:
netserver Kdump: loaded Not tainted 6.14.0-rc4+ #415
[Fri Mar 7 08:00:52 2025] Hardware name: Tencent Cloud CVM, BIOS
seabios-1.9.1-qemu-project.org 04/01/2014
[Fri Mar 7 08:00:52 2025] Call Trace:
[Fri Mar 7 08:00:52 2025] <TASK>
[Fri Mar 7 08:00:52 2025] dump_stack_lvl+0x5b/0x70
[Fri Mar 7 08:00:52 2025] dump_stack+0x10/0x20
[Fri Mar 7 08:00:52 2025] tcp_set_window_clamp+0xc3/0x1f0
[Fri Mar 7 08:00:52 2025] tcp_event_data_recv+0x35b/0x400
[Fri Mar 7 08:00:52 2025] tcp_data_queue+0x6d0/0xc70
[Fri Mar 7 08:00:52 2025] tcp_rcv_established+0x1f5/0x760
[Fri Mar 7 08:00:52 2025] ? schedule_timeout+0xe5/0x100
[Fri Mar 7 08:00:52 2025] tcp_v4_do_rcv+0x171/0x2d0
[Fri Mar 7 08:00:52 2025] __release_sock+0xd1/0xe0
[Fri Mar 7 08:00:52 2025] release_sock+0x30/0xa0
[Fri Mar 7 08:00:52 2025] inet_accept+0x5c/0x80
[Fri Mar 7 08:00:52 2025] do_accept+0xf1/0x180
Test is simple on my virtual machine just by running "netperf -H
127.0.0.1" which uses TCP_STREAM by default.
Thanks,
Jason
Powered by blists - more mailing lists