netdev - Re: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iL8vOUOH9bZaiA-cKcms+PotuKCxv7LpVx3RF0dDDSnmg@mail.gmail.com>
Date:   Tue, 15 Feb 2022 07:25:31 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Jason Xing <kerneljasonxing@...il.com>
Cc:     David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Andrii Nakryiko <andrii@...nel.org>,
        Martin KaFai Lau <kafai@...com>,
        Song Liu <songliubraving@...com>, Yonghong Song <yhs@...com>,
        John Fastabend <john.fastabend@...il.com>,
        KP Singh <kpsingh@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
        Wei Wang <weiwan@...gle.com>,
        Alexander Aring <aahringo@...hat.com>,
        Yangbo Lu <yangbo.lu@....com>, Florian Westphal <fw@...len.de>,
        Tonghao Zhang <xiangxia.m.yue@...il.com>,
        Thomas Gleixner <tglx@...utronix.de>,
        netdev <netdev@...r.kernel.org>,
        LKML <linux-kernel@...r.kernel.org>, bpf <bpf@...r.kernel.org>,
        Jason Xing <xingwanli@...ishou.com>
Subject: Re: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced

On Tue, Feb 15, 2022 at 2:37 AM <kerneljasonxing@...il.com> wrote:
>
> From: Jason Xing <xingwanli@...ishou.com>
>
> Normally, user doesn't care the logic behind the kernel if they're
> trying to set receive buffer via setsockopt. However, if the new value
> of the receive buffer is not smaller than the initial value which is
> sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's
> wscale will shrink and then lead to the bad bandwidth. I think it is
> not appropriate.

Then do not use SO_RCVBUF ?

It is working as intended really.

>
> Here are some numbers:
> $ sysctl -a | grep rmem
> net.core.rmem_default = 212992
> net.core.rmem_max = 40880000
> net.ipv4.tcp_rmem = 4096        425984  40880000
>
> Case 1
> on the server side
>     # iperf -s -p 5201
> on the client side
>     # iperf -c [client ip] -p 5201
> It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
> server side is 10. It's good.
>
> Case 2
> on the server side
>     #iperf -s -p 5201 -w 425984
> on the client side
>     # iperf -c [client ip] -p 5201
> It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
> wcale is 2, even though the receive buffer is not changed at all at the
> very beginning.

Great, you discovered auto tuning is working as intended.

>
> Therefore, I added one condition where only user is trying to set a
> smaller rx buffer. After this patch is applied, the bandwidth of case 2
> is recovered to 9.34 Gbits/sec.
>
> Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")

This commit has nothing to do with your patch or feature.

> Signed-off-by: Jason Xing <xingwanli@...ishou.com>
> ---
>  net/core/filter.c | 7 ++++---
>  net/core/sock.c   | 8 +++++---
>  2 files changed, 9 insertions(+), 6 deletions(-)
>
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 4603b7c..99f5d9c 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
>                 case SO_RCVBUF:
>                         val = min_t(u32, val, sysctl_rmem_max);
>                         val = min_t(int, val, INT_MAX / 2);
> -                       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> -                       WRITE_ONCE(sk->sk_rcvbuf,
> -                                  max_t(int, val * 2, SOCK_MIN_RCVBUF));
> +                       val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +                       if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> +                               sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +                       WRITE_ONCE(sk->sk_rcvbuf, val);
>                         break;
>                 case SO_SNDBUF:
>                         val = min_t(u32, val, sysctl_wmem_max);
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 4ff806d..e5e9cb0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
>          * as a negative value.
>          */
>         val = min_t(int, val, INT_MAX / 2);
> -       sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> -
>         /* We double it on the way in to account for "struct sk_buff" etc.
>          * overhead.   Applications assume that the SO_RCVBUF setting they make
>          * will allow that much actual data to be received on that socket.
> @@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
>          * And after considering the possible alternatives, returning the value
>          * we actually used in getsockopt is the most desirable behavior.
>          */
> -       WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
> +       val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
> +       if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
> +               sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
> +
> +       WRITE_ONCE(sk->sk_rcvbuf, val);
>  }
>
>  void sock_set_rcvbuf(struct sock *sk, int val)

You are breaking applications that want to set sk->sk_rcvbuf  to a fixed value,
to control memory usage on millions of active sockets in a host.

I think that you want new functionality, with new SO_ socket options,
targeting net-next tree (No spurious FIxes: tag)

For instance letting an application set  or unset  SOCK_RCVBUF_LOCK
would be more useful, and would not break applications.