lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CADVnQynUV+gqdH4gKhYKRqW_MpQ5gvMo5n=HHCa08uQ8wBbF_A@mail.gmail.com>
Date: Mon, 29 Jul 2024 11:32:40 -0400
From: Neal Cardwell <ncardwell@...gle.com>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Subash Abhinov Kasiviswanathan <quic_subashab@...cinc.com>, soheil@...gle.com, yyd@...gle.com, 
	ycheng@...gle.com, davem@...emloft.net, kuba@...nel.org, 
	netdev@...r.kernel.org, dsahern@...nel.org, pabeni@...hat.com, 
	Sean Tranchetti <quic_stranche@...cinc.com>
Subject: Re: [PATCH net v2] tcp: Adjust clamping window for applications
 specifying SO_RCVBUF

On Mon, Jul 29, 2024 at 11:19 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Mon, Jul 29, 2024 at 4:51 PM Neal Cardwell <ncardwell@...gle.com> wrote:
> >
> > On Fri, Jul 26, 2024 at 4:41 PM Subash Abhinov Kasiviswanathan
> > <quic_subashab@...cinc.com> wrote:
> > >
> > > tp->scaling_ratio is not updated based on skb->len/skb->truesize once
> > > SO_RCVBUF is set leading to the maximum window scaling to be 25% of
> > > rcvbuf after
> > > commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> > > and 50% of rcvbuf after
> > > commit 697a6c8cec03 ("tcp: increase the default TCP scaling ratio").
> > > 50% tries to emulate the behavior of older kernels using
> > > sysctl_tcp_adv_win_scale with default value.
> > >
> > > Systems which were using a different values of sysctl_tcp_adv_win_scale
> > > in older kernels ended up seeing reduced download speeds in certain
> > > cases as covered in https://lists.openwall.net/netdev/2024/05/15/13
> > > While the sysctl scheme is no longer acceptable, the value of 50% is
> > > a bit conservative when the skb->len/skb->truesize ratio is later
> > > determined to be ~0.66.
> > >
> > > Applications not specifying SO_RCVBUF update the window scaling and
> > > the receiver buffer every time data is copied to userspace. This
> > > computation is now used for applications setting SO_RCVBUF to update
> > > the maximum window scaling while ensuring that the receive buffer
> > > is within the application specified limit.
> > >
> > > Fixes: dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale")
> > > Signed-off-by: Sean Tranchetti <quic_stranche@...cinc.com>
> > > Signed-off-by: Subash Abhinov Kasiviswanathan <quic_subashab@...cinc.com>
> > > ---
> > > v1 -> v2
> > >   Update the condition for SO_RCVBUF window_clamp updates to always
> > >   monitor the current rcvbuf value as suggested by Eric.
> > >
> > >  net/ipv4/tcp_input.c | 23 ++++++++++++++++-------
> > >  1 file changed, 16 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 454362e359da..e2b9583ed96a 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -754,8 +754,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
> > >          * <prev RTT . ><current RTT .. ><next RTT .... >
> > >          */
> > >
> > > -       if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf) &&
> > > -           !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
> > > +       if (READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf)) {
> > >                 u64 rcvwin, grow;
> > >                 int rcvbuf;
> > >
> > > @@ -771,12 +770,22 @@ void tcp_rcv_space_adjust(struct sock *sk)
> > >
> > >                 rcvbuf = min_t(u64, tcp_space_from_win(sk, rcvwin),
> > >                                READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_rmem[2]));
> > > -               if (rcvbuf > sk->sk_rcvbuf) {
> > > -                       WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
> > > +               if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) {
> > > +                       if (rcvbuf > sk->sk_rcvbuf) {
> > > +                               WRITE_ONCE(sk->sk_rcvbuf, rcvbuf);
> > >
> > > -                       /* Make the window clamp follow along.  */
> > > -                       WRITE_ONCE(tp->window_clamp,
> > > -                                  tcp_win_from_space(sk, rcvbuf));
> > > +                               /* Make the window clamp follow along.  */
> > > +                               WRITE_ONCE(tp->window_clamp,
> > > +                                          tcp_win_from_space(sk, rcvbuf));
> > > +                       }
> > > +               } else {
> > > +                       /* Make the window clamp follow along while being bounded
> > > +                        * by SO_RCVBUF.
> > > +                        */
> > > +                       int clamp = tcp_win_from_space(sk, min(rcvbuf, sk->sk_rcvbuf));
> > > +
> > > +                       if (clamp > tp->window_clamp)
> > > +                               WRITE_ONCE(tp->window_clamp, clamp);
> > >                 }
> > >         }
> > >         tp->rcvq_space.space = copied;
> > > --
> >
> > Is this the correct place to put this new code to update
> > tp->window_clamp? AFAICT it's not the correct place.
> >
> > If a system administrator has disabled receive buffer autotuning by
> > setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or if (copied <=
> > tp->rcvq_space.space), then TCP connections will not reach this new
> > code, and the window_clamp will not be adjusted, and the receive
> > window will still be too low.
> >
> > Even if a system administrator has disabled receive buffer autotuning
> > by setting `sysctl net.ipv4.tcp_moderate_rcvbuf=0`, or even if (copied
> > <= tp->rcvq_space.space), AFAICT we still want the correct receive
> > window value for whatever sk->sk_rcvbuf we have, based on the correct
> > tp->scaling_ratio.
> >
> > So AFAICT the correct place to put this kind of logic is in
> > tcp_measure_rcv_mss(). If we compute a new scaling_ratio and it's
> > different than tp->scaling_ratio, then it seems we should compute a
> > new window_clamp value using sk->sk_rcvbuf, and if the new
> > window_clamp value is different then we should WRITE_ONCE that value
> > into tp->window_clamp.
> >
> > That way we can have the correct tp->window_clamp, no matter the value
> > of net.ipv4.tcp_moderate_rcvbuf, and even if (copied <=
> > tp->rcvq_space.space).
> >
> > How does that sound?
>
> Can this be done without adding new code in the fast path ?

I was imagining that the code would not really be in the fast path,
because it would move to the spot in tcp_measure_rcv_mss() where the
segment length "len" is greater than icsk->icsk_ack.rcv_mss. I imagine
that should be rare, and we are already doing a somewhat expensive
do_div() call there, so I was imagining that the additional cost would
be relatively rare and small?

> Otherwise, I feel that we send a wrong signal to 'administrators' :
> "We will maintain code to make sure that wrong sysctls settings were
> not so wrong."
>
> Are you aware of anyone changing net.ipv4.tcp_moderate_rcvbuf for any
> valid reason ?

No, I'm not aware of any valid reason to disable
net.ipv4.tcp_moderate_rcvbuf. :-)

Even if tcp_moderate_rcvbuf is enabled, AFAICT there is still the
issue that if (copied <= tp->rcvq_space.space) we will not get to this
code...

neal

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ