linux-kernel - [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-Id: <20220215103639.11739-1-kerneljasonxing@gmail.com>
Date:   Tue, 15 Feb 2022 18:36:39 +0800
From:   kerneljasonxing@...il.com
To:     davem@...emloft.net, kuba@...nel.org, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, kafai@...com,
        songliubraving@...com, yhs@...com, john.fastabend@...il.com,
        kpsingh@...nel.org, edumazet@...gle.com, pabeni@...hat.com,
        weiwan@...gle.com, aahringo@...hat.com, yangbo.lu@....com,
        fw@...len.de, xiangxia.m.yue@...il.com, tglx@...utronix.de
Cc:     netdev@...r.kernel.org, linux-kernel@...r.kernel.org,
        bpf@...r.kernel.org, kerneljasonxing@...il.com,
        Jason Xing <xingwanli@...ishou.com>
Subject: [PATCH] net: do not set SOCK_RCVBUF_LOCK if sk_rcvbuf isn't reduced

From: Jason Xing <xingwanli@...ishou.com>

Normally, user doesn't care the logic behind the kernel if they're
trying to set receive buffer via setsockopt. However, if the new value
of the receive buffer is not smaller than the initial value which is
sysctl_tcp_rmem[1] implemented in tcp_rcv_space_adjust(), the server's
wscale will shrink and then lead to the bad bandwidth. I think it is
not appropriate.

Here are some numbers:
$ sysctl -a | grep rmem
net.core.rmem_default = 212992
net.core.rmem_max = 40880000
net.ipv4.tcp_rmem = 4096	425984	40880000

Case 1
on the server side
    # iperf -s -p 5201
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is 9.34 Gbits/sec while the wscale of
server side is 10. It's good.

Case 2
on the server side
    #iperf -s -p 5201 -w 425984
on the client side
    # iperf -c [client ip] -p 5201
It turns out that the bandwidth is reduced to 2.73 Gbits/sec while the
wcale is 2, even though the receive buffer is not changed at all at the
very beginning.

Therefore, I added one condition where only user is trying to set a
smaller rx buffer. After this patch is applied, the bandwidth of case 2
is recovered to 9.34 Gbits/sec.

Fixes: e88c64f0a425 ("tcp: allow effective reduction of TCP's rcv-buffer via setsockopt")
Signed-off-by: Jason Xing <xingwanli@...ishou.com>
---
 net/core/filter.c | 7 ++++---
 net/core/sock.c   | 8 +++++---
 2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 4603b7c..99f5d9c 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4795,9 +4795,10 @@ static int _bpf_setsockopt(struct sock *sk, int level, int optname,
 		case SO_RCVBUF:
 			val = min_t(u32, val, sysctl_rmem_max);
 			val = min_t(int, val, INT_MAX / 2);
-			sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-			WRITE_ONCE(sk->sk_rcvbuf,
-				   max_t(int, val * 2, SOCK_MIN_RCVBUF));
+			val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+			if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
+				sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+			WRITE_ONCE(sk->sk_rcvbuf, val);
 			break;
 		case SO_SNDBUF:
 			val = min_t(u32, val, sysctl_wmem_max);
diff --git a/net/core/sock.c b/net/core/sock.c
index 4ff806d..e5e9cb0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -923,8 +923,6 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
 	 * as a negative value.
 	 */
 	val = min_t(int, val, INT_MAX / 2);
-	sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
-
 	/* We double it on the way in to account for "struct sk_buff" etc.
 	 * overhead.   Applications assume that the SO_RCVBUF setting they make
 	 * will allow that much actual data to be received on that socket.
@@ -935,7 +933,11 @@ static void __sock_set_rcvbuf(struct sock *sk, int val)
 	 * And after considering the possible alternatives, returning the value
 	 * we actually used in getsockopt is the most desirable behavior.
 	 */
-	WRITE_ONCE(sk->sk_rcvbuf, max_t(int, val * 2, SOCK_MIN_RCVBUF));
+	val = max_t(int, val * 2, SOCK_MIN_RCVBUF);
+	if (val < sock_net(sk)->ipv4.sysctl_tcp_rmem[1])
+		sk->sk_userlocks |= SOCK_RCVBUF_LOCK;
+
+	WRITE_ONCE(sk->sk_rcvbuf, val);
 }
 
 void sock_set_rcvbuf(struct sock *sk, int val)
-- 
1.8.3.1