[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89i++56L396Mhr1LxL3UN6D9uPMGsp5yaDTY5n4bhuir_BQ@mail.gmail.com>
Date: Wed, 27 Jul 2022 14:54:59 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Marek Majkowski <marek@...udflare.com>
Cc: Lawrence Brakmo <brakmo@...com>, netdev <netdev@...r.kernel.org>,
bpf <bpf@...r.kernel.org>,
kernel-team <kernel-team@...udflare.com>,
Ivan Babrou <ivan@...udflare.com>,
David Miller <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>
Subject: Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the
rcv_ssthresh above 64KiB
On Wed, Jul 27, 2022 at 1:19 PM Marek Majkowski <marek@...udflare.com> wrote:
>
> On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@...udflare.com> wrote:
> > >
> > > We already support RTAX_INITRWND / initrwnd path attribute:
> > >
> > > $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
> > >
> > > However normally, the initial advertised receive window is limited to
> > > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
> > > that, bumping up rcv_ssthresh to value derived from initrwnd. This
> > > allows for larger initial advertised receive windows, which is useful
> > > for specific types of TCP flows: big BDP ones, where there is a lot of
> > > data to send immediately after the flow is established.
> > >
> > > There are three places where we initialize sockets:
> > > - tcp_output:tcp_connect_init
> > > - tcp_minisocks:tcp_openreq_init_rwin
> > > - syncookies
> > >
> > > In the first two we already have a call to `tcp_rwnd_init_bpf` and
> > > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
> > > attribute. We use this value to bring `rcv_ssthresh` up, potentially
> > > above the traditional 64KiB.
> > >
> > > With higher initial `rcv_ssthresh` the receiver will open the receive
> > > window more aggresively, which can improve large BDP flows - large
> > > throughput and latency.
> > >
> > > This patch does not cover the syncookies case.
> > >
> > > Signed-off-by: Marek Majkowski <marek@...udflare.com>
> > > ---
> > > include/net/inet_sock.h | 1 +
> > > net/ipv4/tcp_minisocks.c | 8 ++++++--
> > > net/ipv4/tcp_output.c | 10 ++++++++--
> > > 3 files changed, 15 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > > index daead5fb389a..bc68c9b70942 100644
> > > --- a/include/net/inet_sock.h
> > > +++ b/include/net/inet_sock.h
> > > @@ -89,6 +89,7 @@ struct inet_request_sock {
> > > no_srccheck: 1,
> > > smc_ok : 1;
> > > u32 ir_mark;
> > > + u32 rcv_ssthresh;
Please move this in struct tcp_request_sock
> >
> > Why do we need to store this value in the request_sock ?
> >
> > It is derived from a route attribute and MSS, all this should be
> > available when the full blown socket is created.
> >
> > It would also work even with syncookies.
>
> Eric,
>
> Thanks for the feedback. For some context, I published a blog post
> explaining this work in detail [1].
>
> https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
>
> I understand the suggestion is to move tcp_rwnd_init_bpf +
> RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into
> `tcp_create_openreq_child`.
>
> I gave it a try (patch: [2]), but I don't think this will work under
> all circumstances. The issue is that we need to advertise *some*
> window in the SYNACK packet, before creating the full blown socket.
>
> With RTAX_INITRWND it is possible to move the advertised window up, or
> down.
>
> In the latter case, of reducing the window, at the SYNACK moment we
> must know if the window is reduced under 64KiB. This is what happens
> right now, we can _reduce_ window with RTAX_INITRWND to small values,
> I guess down to 1 MSS. This smaller window is then advertised in the
> SYNACK.
>
> If we move RTAX_INITRWND lookup into the later
> `tcp_create_openreq_child` then it will be too late, we won't know the
> correct window size on SYNACK stage. We will likely end up sending
> large window on SYNACK and then a small window on subsequent ACK,
> violating TCP.
>
> There are two approaches here. First, keep the semantics and allow
> RTAX_INITRWND to _reduce_ the initial window.
>
> In this case there are four ways out of this:
>
> 1) Keep it as proposed, that indeed requires some new value in
> request_sock. (perhaps maybe it could be it smaller than u32)
>
> 2) Send the SYNACK with small/zero window, since we haven't done the
> initrwnd lookup at this stage, but that would be at least
> controversial, and also adds one more RTT to the common case. I don't
> think this is acceptable.
>
> 3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to
> figure out if the window is smaller than 64KiB, second one in
> `tcp_create_openreq_child` to figure out if the suggested window is
> larger than 64KiB.
I think syncookies can be handled, if you look at cookie_v6_check() &
cookie_v4_check()
after their calls to cookie_tcp_reqsk_alloc()
>
> 4) Abort the whole approach and recycle Ivan's
> bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route
> attribute approach, seems easier to use and more flexible.
>
> But, thinking about it, I don't think we could ever support reducing
> initial receive window in the syncookie case. Only (3) - two initrwnd
> lookups - could be made to work, but even that is controversial.
>
> However the intention of RTAX_INITRWND as far as I understand was to
> _increase_ rcv_ssthresh, back in the days when it started from 10MSS
> (so I was told).
That was before we fixed DRS and that we made initial RWIN 65535, the
max allowed value in a SYN , SYNACK packet.
But yes...
>
> So, we could change the semantics of RTAX_INITRWND to allow only
> *increasing* the window - and disallow reducing it. With such an
> approach indeed we could make the code as you suggested, and move the
> route attribute lookup away from minisocks into `tcp_create_openreq_child`.
>
> Marek
>
> [1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
> [2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879
> [3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/
Powered by blists - more mailing lists