linux-kernel - [PATCH v2 net-next] net: introduce SO_RCVBUFAUTO to let the rcv

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220216050320.3222-1-kerneljasonxing@gmail.com>
Date:   Wed, 16 Feb 2022 13:03:20 +0800
From:   kerneljasonxing@...il.com
To:     davem@...emloft.net, kuba@...nel.org, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, kafai@...com,
        songliubraving@...com, yhs@...com, john.fastabend@...il.com,
        kpsingh@...nel.org, edumazet@...gle.com, pabeni@...hat.com,
        weiwan@...gle.com, aahringo@...hat.com, yangbo.lu@....com,
        fw@...len.de, xiangxia.m.yue@...il.com, tglx@...utronix.de
Cc:     netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        bpf@...r.kernel.org, kerneljasonxing@...il.com,
        Jason Xing <xingwanli@...ishou.com>
Subject: [PATCH v2 net-next] net: introduce SO_RCVBUFAUTO to let the rcv_buf tune automatically

From: Jason Xing <xingwanli@...ishou.com>

Normally, user doesn't care the logic behind the kernel if they're
trying to set receive buffer via setsockopt. However, once the new
value of the receive buffer is set even though it's not smaller than
the initial value which is sysctl_tcp_rmem[1] implemented in
tcp_rcv_space_adjust(),, the server's wscale will shrink and then
lead to the bad bandwidth as intended.

For now, introducing a new socket option to let the receive buffer
grow automatically no matter what the new value is can solve
the bad bandwidth issue meanwhile it's not breaking the application
with SO_RCVBUF option set.

Here are some numbers:
$ sysctl -a | grep rmem
net.core.rmem_default = 212992
net.core.rmem_max = 40880000
net.ipv4.tcp_rmem = 4096	425984	40880000

Case 1
on the server side
    # iperf -s -p 5201
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
server side is 10. It's good.

Case 2
on the server side
    #iperf -s -p 5201 -w 425984
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
wcale is 2, even though the receive buffer is not changed at all at the
very beginning.

After this patch is applied, the bandwidth of case 2 is recovered to
9.34 Gbits/sec as expected at the cost of consuming more memory per
socket.

Signed-off-by: Jason Xing <xingwanli@...ishou.com>
--
v2: suggested by Eric
- introduce new socket option instead of breaking the logic in SO_RCVBUF
- Adjust the title and description of this patch
link: https://lore.kernel.org/lkml/CANn89iL8vOUOH9bZaiA-cKcms+PotuKCxv7LpVx3RF0dDDSnmg@mail.gmail.com/
---
 include/uapi/asm-generic/socket.h |  1 +
 net/core/sock.c                   | 18 +++++++++++++-----
 2 files changed, 14 insertions(+), 5 deletions(-)

diff --git a/include/uapi/asm-generic/socket.h b/include/uapi/asm-generic/socket.h
index c77a131..f4ce79b 100644
--- a/include/uapi/asm-generic/socket.h
+++ b/include/uapi/asm-generic/socket.h
@@ -18,6 +18,7 @@
 #define SO_RCVBUF	8
 #define SO_SNDBUFFORCE	32
 #define SO_RCVBUFFORCE	33
+#define SO_RCVBUFAUTO	74
 #define SO_KEEPALIVE	9
 #define SO_OOBINLINE	10
 #define SO_NO_CHECK	11
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ff806d..8565684 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -917,13 +917,14 @@ void sock_set_keepalive(struct sock *sk)
 }
 EXPORT_SYMBOL(sock_set_keepalive);
 
-static void __sock_set_rcvbuf(struct sock *sk, int val)
+static void __sock_set_rcvbuf(struct sock *sk, int val, bool is_set)
 {
 	/* Ensure val * 2 fits into an int, to prevent max_t() from treating it
 	 * as a negative value.
 	 */
 	val = min_t(int, val, INT_MAX / 2);
-	sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+	if (is_set)
+		sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
 
 	/* We double it on the way in to account for "struct sk_buff" etc.
 	 * overhead.   Applications assume that the SO_RCVBUF setting they make
@@ -941,7 +942,7 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
 void sock_set_rcvbuf(struct sock *sk, int val)
 {
 	lock_sock(sk);
-	__sock_set_rcvbuf(sk, val);
+	__sock_set_rcvbuf(sk, val, true);
 	release_sock(sk);
 }
 EXPORT_SYMBOL(sock_set_rcvbuf);
@@ -1106,7 +1107,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		 * play 'guess the biggest size' games. RCVBUF/SNDBUF
 		 * are treated in BSD as hints
 		 */
-		__sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max));
+		__sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max), true);
 		break;
 
 	case SO_RCVBUFFORCE:
@@ -1118,7 +1119,14 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
 		/* No negative values (to prevent underflow, as val will be
 		 * multiplied by 2).
 		 */
-		__sock_set_rcvbuf(sk, max(val, 0));
+		__sock_set_rcvbuf(sk, max(val, 0), true);
+		break;
+
+	case SO_RCVBUFAUTO:
+		/* Though similar to SO_RCVBUF, we do not use userlocks in
+		 * order to let the receive buffer tune automatically.
+		 */
+		__sock_set_rcvbuf(sk, min_t(u32, val, sysctl_rmem_max), false);
 		break;
 
 	case SO_KEEPALIVE:
-- 
1.8.3.1