lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Wed, 27 Jul 2022 13:19:08 +0200
From:   Marek Majkowski <marek@...udflare.com>
To:     Eric Dumazet <edumazet@...gle.com>, brakmo@...com
Cc:     netdev <netdev@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        kernel-team <kernel-team@...udflare.com>,
        Ivan Babrou <ivan@...udflare.com>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Paolo Abeni <pabeni@...hat.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>
Subject: Re: [PATCH net-next 1/2] RTAX_INITRWND should be able to bring the
 rcv_ssthresh above 64KiB

On Fri, Jul 22, 2022 at 11:23 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Thu, Jul 21, 2022 at 5:10 PM Marek Majkowski <marek@...udflare.com> wrote:
> >
> > We already support RTAX_INITRWND / initrwnd path attribute:
> >
> >  $ ip route change local 127.0.0.0/8 dev lo initrwnd 1024
> >
> > However normally, the initial advertised receive window is limited to
> > 64KiB by rcv_ssthresh, regardless of initrwnd. This patch changes
> > that, bumping up rcv_ssthresh to value derived from initrwnd. This
> > allows for larger initial advertised receive windows, which is useful
> > for specific types of TCP flows: big BDP ones, where there is a lot of
> > data to send immediately after the flow is established.
> >
> > There are three places where we initialize sockets:
> >  - tcp_output:tcp_connect_init
> >  - tcp_minisocks:tcp_openreq_init_rwin
> >  - syncookies
> >
> > In the first two we already have a call to `tcp_rwnd_init_bpf` and
> > `dst_metric(RTAX_INITRWND)` which retrieve the bpf/path initrwnd
> > attribute. We use this value to bring `rcv_ssthresh` up, potentially
> > above the traditional 64KiB.
> >
> > With higher initial `rcv_ssthresh` the receiver will open the receive
> > window more aggresively, which can improve large BDP flows - large
> > throughput and latency.
> >
> > This patch does not cover the syncookies case.
> >
> > Signed-off-by: Marek Majkowski <marek@...udflare.com>
> > ---
> >  include/net/inet_sock.h  |  1 +
> >  net/ipv4/tcp_minisocks.c |  8 ++++++--
> >  net/ipv4/tcp_output.c    | 10 ++++++++--
> >  3 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
> > index daead5fb389a..bc68c9b70942 100644
> > --- a/include/net/inet_sock.h
> > +++ b/include/net/inet_sock.h
> > @@ -89,6 +89,7 @@ struct inet_request_sock {
> >                                 no_srccheck: 1,
> >                                 smc_ok     : 1;
> >         u32                     ir_mark;
> > +       u32                     rcv_ssthresh;
>
> Why do we need to store this value in the request_sock ?
>
> It is derived from a route attribute and MSS, all this should be
> available when the full blown socket is created.
>
> It would also work even with syncookies.

Eric,

Thanks for the feedback. For some context, I published a blog post
explaining this work in detail [1].

https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/

I understand the suggestion is to move tcp_rwnd_init_bpf +
RTAX_INITRWND lookup from `tcp_openreq_init_rwin` into
`tcp_create_openreq_child`.

I gave it a try (patch: [2]), but I don't think this will work under
all circumstances. The issue is that we need to advertise *some*
window in the SYNACK packet, before creating the full blown socket.

With RTAX_INITRWND it is possible to move the advertised window up, or
down.

In the latter case, of reducing the window, at the SYNACK moment we
must know if the window is reduced under 64KiB. This is what happens
right now, we can _reduce_ window with RTAX_INITRWND to small values,
I guess down to 1 MSS. This smaller window is then advertised in the
SYNACK.

If we move RTAX_INITRWND lookup into the later
`tcp_create_openreq_child` then it will be too late, we won't know the
correct window size on SYNACK stage. We will likely end up sending
large window on SYNACK and then a small window on subsequent ACK,
violating TCP.

There are two approaches here. First, keep the semantics and allow
RTAX_INITRWND to _reduce_ the initial window.

In this case there are four ways out of this:

1) Keep it as proposed, that indeed requires some new value in
request_sock. (perhaps maybe it could be it smaller than u32)

2) Send the SYNACK with small/zero window, since we haven't done the
initrwnd lookup at this stage, but that would be at least
controversial, and also adds one more RTT to the common case. I don't
think this is acceptable.

3) Do two initrwnd lookups. One in the `tcp_openreq_init_rwin` to
figure out if the window is smaller than 64KiB, second one in
`tcp_create_openreq_child` to figure out if the suggested window is
larger than 64KiB.

4) Abort the whole approach and recycle Ivan's
bpf_setsockopt(TCP_BPF_RCV_SSTHRESH) approach [3]. But I prefer the route
attribute approach, seems easier to use and more flexible.

But, thinking about it, I don't think we could ever support reducing
initial receive window in the syncookie case. Only (3) - two initrwnd
lookups - could be made to work, but even that is controversial.

However the intention of RTAX_INITRWND as far as I understand was to
_increase_ rcv_ssthresh, back in the days when it started from 10MSS
(so I was told).

So, we could change the semantics of RTAX_INITRWND to allow only
*increasing* the window - and disallow reducing it. With such an
approach indeed we could make the code as you suggested, and move the
route attribute lookup away from minisocks into `tcp_create_openreq_child`.

Marek

[1] https://blog.cloudflare.com/when-the-window-is-not-fully-open-your-tcp-stack-is-doing-more-than-you-think/
[2] https://gist.github.com/majek/13848c050a3dc218ed295364ee717879
[3] https://lore.kernel.org/bpf/20220111192952.49040-1-ivan@cloudflare.com/t/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ