[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAL+tcoCsTSW=cMReNEciFcJMCsFq9DxXoduOGFzmGTfrdR60Hw@mail.gmail.com>
Date: Tue, 21 May 2024 08:58:17 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: dsahern@...nel.org, kuba@...nel.org, pabeni@...hat.com,
davem@...emloft.net, ncardwell@...gle.com, netdev@...r.kernel.org,
Jason Xing <kernelxing@...cent.com>
Subject: Re: [RFC PATCH v2 net-next] tcp: remove 64 KByte limit for initial
tp->rcv_wnd value
On Tue, May 21, 2024 at 8:36 AM Jason Xing <kerneljasonxing@...il.com> wrote:
>
> Hello Eric,
>
> On Tue, May 21, 2024 at 12:51 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Sat, May 18, 2024 at 4:50 AM Jason Xing <kerneljasonxing@...il.com> wrote:
> > >
> > > From: Jason Xing <kernelxing@...cent.com>
> > >
> > > In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin
> > > to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most
> > > CDN team would not benefit from this change because they cannot have a
> > > large window to receive a big packet, which will be slowed down especially
> > > in long RTT.
> > >
> > > According to RFC 7323, it says:
> > > "The maximum receive window, and therefore the scale factor, is
> > > determined by the maximum receive buffer space."
> >
> > This seems not relevant ? wscale factor is not changed in this patch ?
> > tp->rcv_wnd is also not the maximum receive window.
>
> Thanks for your review.
>
> I can remove this part. I was trying to claim I do not break RFC.
>
> >
> > >
> > > So we can get rid of this 64k limitation and let the window be tunable if
> > > the user wants to do it within the control of buffer space. Then many
> > > companies, I believe, can have the same behaviour as old days.
> >
> > Not sure this has ever worked, see below.
> >
> > Also, the "many companies ..." mention has nothing to do in a changelog.
>
> Oh, I just copied/translated from my initial studies of this rcv_wnd
> by reading many papers something like this.
>
> I can also remove this sentence.
>
> >
> >
> > > Besides,
> > > there are many papers conducting various interesting experiments which
> > > have something to do with this window and show good outputs in some cases,
> > > say, paper [1] in Yahoo! CDN.
> >
> > I think this changelog is trying hard to sell something, but in
> > reality TCP 3WHS nature
> > makes your claims wrong.
> >
> > Instead, you should clearly document that this problem can _not_ be
> > solved for both
> > active _and_ passive connections.
> >
> > In the first RTT, a client (active connection) can not send more than
> > 64KB, if TCP specs
> > are properly applied.
>
> Having a large rcv_wnd if the user can tweak this knob can help
> transfer data more rapidly. I'm not referring to the first RTT.
For the first RTT, it is surely limited to 64 KB at most as you said.
For the whole process, the change can accelerate the sending process
and save some RTTs.
How can I find this change? We had some servers upgraded to the latest
kernel and noticed the indicator from the user side showed worse
results than before. Because of this, I spent some time digging into
this part.
After applying this patch, the indicator shows normal/good results
like before. It is proven it works.
For the CDN team, they are very sensitive to the latency/time about
sending big data in the long RTT.
Thanks,
Jason
>
> >
> > >
> > > To avoid future confusion, current change doesn't affect the initial
> > > receive window on the wire in a SYN or SYN+ACK packet which are set within
> > > 65535 bytes according to RFC 7323 also due to the limit in
> > > __tcp_transmit_skb():
> > >
> > > th->window = htons(min(tp->rcv_wnd, 65535U));
> > >
> > > In one word, __tcp_transmit_skb() already ensures that constraint is
> > > respected, no matter how large tp->rcv_wnd is.
> > >
> > > Let me provide one example if with or without the patch:
> > > Before:
> > > client --- SYN: rwindow=65535 ---> server
> > > client <--- SYN+ACK: rwindow=65535 ---- server
> > > client --- ACK: rwindow=65536 ---> server
> > > Note: for the last ACK, the calculation is 512 << 7.
> > >
> > > After:
> > > client --- SYN: rwindow=65535 ---> server
> > > client <--- SYN+ACK: rwindow=65535 ---- server
> > > client --- ACK: rwindow=175232 ---> server
> > > Note: I use the following command to make it work:
> > > ip route change default via [ip] dev eth0 metric 100 initrwnd 120
> > > For the last ACK, the calculation is 1369 << 7.
> > >
> > > We can pay attention to the last ACK in 3-way shakehand and notice that
> > > with the patch applied the window can reach more than 64 KByte.
> >
> > You carefully avoided mentioning the asymmetry.
> > I do not think this is needed in the changelog, because this is adding
> > confusion.
>
> What kind of case I've met in production is only about whether we're
> capable of sending more data at the same time at the very beginning,
> so I care much more about the sending process right now.
>
> >
> > >
> > > [1]: https://conferences.sigcomm.org/imc/2011/docs/p569.pdf
> > >
> > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > ---
> > > v2
> > > Link: https://lore.kernel.org/all/20240517085031.18896-1-kerneljasonxing@gmail.com/
> > > 1. revise the title and body messages (Neal)
> > > ---
> > > net/ipv4/tcp_output.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 95caf8aaa8be..95618d0e78e4 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -232,7 +232,7 @@ void tcp_select_initial_window(const struct sock *sk, int __space, __u32 mss,
> > > if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_workaround_signed_windows))
> > > (*rcv_wnd) = min(space, MAX_TCP_WINDOW);
> > > else
> > > - (*rcv_wnd) = min_t(u32, space, U16_MAX);
> > > + (*rcv_wnd) = space;
> >
> > This is probably breaking some packetdrill tests, but your change
> > might [1] be good,
>
> I'll do some packetdrill tests and get back some information soon.
>
> > especially because it allows DRS behavior to be consistent for large
> > MTU (eg MTU 9000) and bigger tcp_rmem[1],
> > even without playing with initrwnd attribute.
> >
> > "ss -temoi " would display after connection setup rcv_space:89600
> > instead of a capped value.
> >
> > [1] This is hard to say, DRS is full of surprises.
>
> To avoid confusion, I will remove this link and relevant statements.
>
> Here are my opinions in conclusion:
> 1) this change doesn't break the law, I mean, various RFCs.
> 2) this change allows us to have the same behaviour as 2018 in this case.
> 3) this change does some good things to certain cases, especially for
> the CDN team.
>
> I'll refine the changelog as far as I can, hoping it will not confuse
> the readers.
>
> Thanks,
> Jason
Powered by blists - more mailing lists