netdev - [PATCH RFC net-next 2/2] tcp: introduce dynamic initcwnd adjustment

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250328151633.30007-3-kerneljasonxing@gmail.com>
Date: Fri, 28 Mar 2025 23:16:33 +0800
From: Jason Xing <kerneljasonxing@...il.com>
To: davem@...emloft.net,
	edumazet@...gle.com,
	kuba@...nel.org,
	pabeni@...hat.com,
	dsahern@...nel.org,
	horms@...nel.org,
	kuniyu@...zon.com,
	ncardwell@...gle.com
Cc: netdev@...r.kernel.org,
	Jason Xing <kernelxing@...cent.com>
Subject: [PATCH RFC net-next 2/2] tcp: introduce dynamic initcwnd adjustment

From: Jason Xing <kernelxing@...cent.com>

More than one decade ago, Google published an important paper[1] that
describes how different initcwnd values have different impacts. Three
years later, initcwnd is set to 10 by default[2] for common use. But
nowadays, more and more small features have been developed for certain
particular cases instead of all the cases.

As we may notice some CDN teams try to increase it even to more than 100
for uncontrollable global network to speed up transmitting data in the slow
start phase. In data center, we also need such a similar change to ramp up
slow start especially for the case where application sometime tries to send
a small amount of data, say, 50K at one time in the persistent connection.
Asking users to tune by 'ip route' might not be that practical because 1)
it may affect those unwanted flows, 2) too big global-wide value may cause
burst for all kinds of flows.

This patch adds a dynamic adjustment feature for initcwnd in the slow start
or slow start from idle phase so that it only accelerates the in first
round trip time and doesn't affect too much for the massive data transfer
case.

Use 65535 as an upper bound to calculate the proper initcwnd. This number
is derived from the case where an skb carries the 65535 window when sending
syn ack at __tcp_transmit_skb(). Without it, the passive open side
sending data is able to see a very big value from the last ack in 3-WHS,
say, 2699776 which means it possibly generates a 1912 initcwnd that is
too big.

This patch can help the small data transfer case accelerate the speed. I
tested transmitting 50k at one time and managed to see the time consumed
decreased from 1400us to 80us. A 1750% delta!

The idea behind this is I often see the small data transfer consumes
more than 2 or 3 rtt because of limited snd_cwnd. In data center, we can
afford the bandwidth if we choose to accelerate transmission.

Why I chose the tp->max_window/tp->mss_cache? It's because cwnd is
increased by per mss packet and max_window is the signal that the other
side tries to tell us the max capacity it can bear. As we can see at
tcp_set_skb_tso_segs(), tcp_gso_size is equal to mss.

[1]: https://developers.google.com/speed/protocols/tcp_initcwnd_techreport.pdf
[2]: https://datatracker.ietf.org/doc/html/rfc6928

Signed-off-by: Jason Xing <kernelxing@...cent.com>
---
I'm not sure what the upper bound of this window should be. 65535 used
as max window generates a 46 initcwnd with the 1412 mss in my vm.
---
 include/linux/tcp.h      |  3 ++-
 include/uapi/linux/tcp.h |  1 +
 net/ipv4/tcp.c           |  8 ++++++++
 net/ipv4/tcp_input.c     | 11 +++++++++--
 4 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index aba0a1fe0e36..445db706f3cd 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -385,7 +385,8 @@ struct tcp_sock {
 		syn_fastopen:1,	/* SYN includes Fast Open option */
 		syn_fastopen_exp:1,/* SYN includes Fast Open exp. option */
 		syn_fastopen_ch:1, /* Active TFO re-enabling probe */
-		syn_data_acked:1;/* data in SYN is acked by SYN-ACK */
+		syn_data_acked:1,/* data in SYN is acked by SYN-ACK */
+		dynamic_initcwnd:1;  /* dynamic adjustment for initcwnd */
 
 	u8	keepalive_probes; /* num of allowed keep alive probes	*/
 	u32	tcp_tx_delay;	/* delay (in usec) added to TX packets */
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index acf77114efed..7c63d0d0b5e1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -143,6 +143,7 @@ enum {
 #define TCP_RTO_MIN_US		45	/* min rto time in us */
 #define TCP_DELACK_MAX_US	46	/* max delayed ack time in us */
 #define TCP_IW			47	/* initial congestion window */
+#define TCP_IW_DYNAMIC         48      /* dynamic adjustment for initcwnd */
 
 #define TCP_REPAIR_ON		1
 #define TCP_REPAIR_OFF		0
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 9da7ece57b20..3d419a714f2d 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3868,6 +3868,11 @@ int do_tcp_setsockopt(struct sock *sk, int level, int optname,
 			return -EINVAL;
 		tp->init_cwnd = val;
 		return 0;
+	case TCP_IW_DYNAMIC:
+		if (val < 0 || val > 1)
+			return -EINVAL;
+		tp->dynamic_initcwnd = val;
+		return 0;
 	}
 
 	sockopt_lock_sock(sk);
@@ -4716,6 +4721,9 @@ int do_tcp_getsockopt(struct sock *sk, int level,
 	case TCP_IW:
 		val = tp->init_cwnd;
 		break;
+	case TCP_IW_DYNAMIC:
+		val = tp->dynamic_initcwnd;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 00cbe8970a1b..05dbec734aa5 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6341,10 +6341,17 @@ void tcp_init_transfer(struct sock *sk, int bpf_op, struct sk_buff *skb)
 	 * initRTO, we only reset cwnd when more than 1 SYN/SYN-ACK
 	 * retransmission has occurred.
 	 */
-	if (tp->total_retrans > 1 && tp->undo_marker)
+	if (tp->total_retrans > 1 && tp->undo_marker) {
 		tcp_snd_cwnd_set(tp, 1);
-	else
+	} else {
+		if (tp->dynamic_initcwnd) {
+			u32 win = min(tp->max_window, 65535);
+
+			tp->init_cwnd = max(win / tp->mss_cache, TCP_INIT_CWND);
+		}
+
 		tcp_snd_cwnd_set(tp, tcp_init_cwnd(tp, __sk_dst_get(sk)));
+	}
 	tp->snd_cwnd_stamp = tcp_jiffies32;
 
 	bpf_skops_established(sk, bpf_op, skb);
-- 
2.43.5