netdev - Re: [PATCH net-next 3/3] tcp: try to send bigger TSO packets

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <83806014-4e57-4974-b188-14c87a4cef8f@nvidia.com>
Date: Thu, 13 Feb 2025 16:45:16 +0200
From: Shahar Shitrit <shshitrit@...dia.com>
To: Eric Dumazet <edumazet@...gle.com>, "David S . Miller"
 <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>,
 Paolo Abeni <pabeni@...hat.com>
Cc: netdev@...r.kernel.org, Neal Cardwell <ncardwell@...gle.com>,
 Kevin Yang <yyd@...gle.com>, eric.dumazet@...il.com,
 Gal Pressman <gal@...dia.com>, Tariq Toukan <tariqt@...dia.com>
Subject: Re: [PATCH net-next 3/3] tcp: try to send bigger TSO packets

Hello,

I'm troubleshooting an issue and would appreciate your input.

The problem occurs when the SYNPROXY extension is configured with
iptables on the server side, and the rmem_max value is set to 512MB on
the same server. The combination of these two settings results in a
significant performance drop - specifically, it reduces the iperf3
bitrate from approximately 30 Gbps to a few Gbps (around 5).

Here are some key points from my investigation:
• When either of these configurations is applied independently, there is
no noticeable impact on performance. The issue only arises when they are
used together.
• The issue persists even when TSO, GSO, and GRO are disabled on both sides.
• The issue persists also with different congestion control algorithms.
• In the pcap, I observe that the server's window size remains small (it
only increases up to 9728 bytes, compared to around 64KB in normal traffic).
• In the tcp_select_window() function, I noticed that increasing the
rmem_max value causes tp->rx_opt.rcv_wscale to become larger (14 instead
of the default value of 7). This, in turn, reduces the window size
returned from the function because it gets shifted by
tp->rx_opt.rcv_wscale. Additionally, sk->sk_rcvbuf stays stuck at its
initial value (tcp_rmem[1]), whereas with normal traffic, it grows
throughout the test. Similarly, sk->sk_backlog.len and sk->sk_rmem_alloc
do not increase and remain at 0 for most of the traffic.
• It appears that there may be an issue with the server’s ability to
receive the skbs, which could explain why sk->sk_rmem_alloc doesn’t grow.
• Based on the iptables counters, there doesn’t seem to be an issue with
the SYNPROXY processing more packets than expected.

Additionally, with a kernel version containing the commit below, the
traffic performance worsens even further, dropping to 95 Kbps. As
observed in the pcap, the server's window size remains at 512 bytes
until it sends a RST. Moreover, from a certain point there's a 4-ms
delay in the server ACK that persists until the RST. No retransmission
is observed.
One indicator of the issue is that the TSO counters don't increment and
remain at 0, which is how we initially identified the problem.
I'm still not sure what might be the connection between the described
issue to this commit.


I would appreciate any insights you might have on this issue, as well as
suggestions for further investigation.

Steps to reproduce:

# server:
ifconfig eth2 1.1.1.1

sysctl -w net.netfilter.nf_conntrack_tcp_loose=0
iptables -t raw -I PREROUTING -i eth2 -w 2 -p tcp -m tcp --syn -j CT
--notrack
iptables -A INPUT -i eth2 -w 2 -p tcp -m tcp -m state --state
INVALID,UNTRACKED -j SYNPROXY --sack-perm --timestamp --wscale 7 --mss 1460

echo '536870912' > /proc/sys/net/core/rmem_max

iperf3 -B 1.1.1.1 -s

# client:
ifconfig eth2 1.1.1.2

iperf3 -B 1.1.1.2 -c 1.1.1.1


If needed, I will send the pcaps.

Thank you,
Shahar Shitrit

On 19/04/2024 0:46, Eric Dumazet wrote:
> While investigating TCP performance, I found that TCP would
> sometimes send big skbs followed by a single MSS skb,
> in a 'locked' pattern.
> 
> For instance, BIG TCP is enabled, MSS is set to have 4096 bytes
> of payload per segment. gso_max_size is set to 181000.
> 
> This means that an optimal TCP packet size should contain
> 44 * 4096 = 180224 bytes of payload,
> 
> However, I was seeing packets sizes interleaved in this pattern:
> 
> 172032, 8192, 172032, 8192, 172032, 8192, <repeat>
> 
> tcp_tso_should_defer() heuristic is defeated, because after a split of
> a packet in write queue for whatever reason (this might be a too small
> CWND or a small enough pacing_rate),
> the leftover packet in the queue is smaller than the optimal size.
> 
> It is time to try to make 'leftover packets' bigger so that
> tcp_tso_should_defer() can give its full potential.
> 
> After this patch, we can see the following output:
> 
> 14:13:34.009273 IP6 sender > receiver: Flags [P.], seq 4048380:4098360, ack 1, win 256, options [nop,nop,TS val 3425678144 ecr 1561784500], length 49980
> 14:13:34.010272 IP6 sender > receiver: Flags [P.], seq 4098360:4148340, ack 1, win 256, options [nop,nop,TS val 3425678145 ecr 1561784501], length 49980
> 14:13:34.011271 IP6 sender > receiver: Flags [P.], seq 4148340:4198320, ack 1, win 256, options [nop,nop,TS val 3425678146 ecr 1561784502], length 49980
> 14:13:34.012271 IP6 sender > receiver: Flags [P.], seq 4198320:4248300, ack 1, win 256, options [nop,nop,TS val 3425678147 ecr 1561784503], length 49980
> 14:13:34.013272 IP6 sender > receiver: Flags [P.], seq 4248300:4298280, ack 1, win 256, options [nop,nop,TS val 3425678148 ecr 1561784504], length 49980
> 14:13:34.014271 IP6 sender > receiver: Flags [P.], seq 4298280:4348260, ack 1, win 256, options [nop,nop,TS val 3425678149 ecr 1561784505], length 49980
> 14:13:34.015272 IP6 sender > receiver: Flags [P.], seq 4348260:4398240, ack 1, win 256, options [nop,nop,TS val 3425678150 ecr 1561784506], length 49980
> 14:13:34.016270 IP6 sender > receiver: Flags [P.], seq 4398240:4448220, ack 1, win 256, options [nop,nop,TS val 3425678151 ecr 1561784507], length 49980
> 14:13:34.017269 IP6 sender > receiver: Flags [P.], seq 4448220:4498200, ack 1, win 256, options [nop,nop,TS val 3425678152 ecr 1561784508], length 49980
> 14:13:34.018276 IP6 sender > receiver: Flags [P.], seq 4498200:4548180, ack 1, win 256, options [nop,nop,TS val 3425678153 ecr 1561784509], length 49980
> 14:13:34.019259 IP6 sender > receiver: Flags [P.], seq 4548180:4598160, ack 1, win 256, options [nop,nop,TS val 3425678154 ecr 1561784510], length 49980
> 
> With 200 concurrent flows on a 100Gbit NIC, we can see a reduction
> of TSO packets (and ACK packets) of about 30 %.
> 
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> ---
>  net/ipv4/tcp_output.c | 38 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 5e8665241f9345f38ce56afffe473948aef66786..99a1d88f7f47b9ef0334efe62f8fd34c0d693ced 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2683,6 +2683,36 @@ void tcp_chrono_stop(struct sock *sk, const enum tcp_chrono type)
>  		tcp_chrono_set(tp, TCP_CHRONO_BUSY);
>  }
>  
> +/* First skb in the write queue is smaller than ideal packet size.
> + * Check if we can move payload from the second skb in the queue.
> + */
> +static void tcp_grow_skb(struct sock *sk, struct sk_buff *skb, int amount)
> +{
> +	struct sk_buff *next_skb = skb->next;
> +	unsigned int nlen;
> +
> +	if (tcp_skb_is_last(sk, skb))
> +		return;
> +
> +	if (!tcp_skb_can_collapse(skb, next_skb))
> +		return;
> +
> +	nlen = min_t(u32, amount, next_skb->len);
> +	if (!nlen || !skb_shift(skb, next_skb, nlen))
> +		return;
> +
> +	TCP_SKB_CB(skb)->end_seq += nlen;
> +	TCP_SKB_CB(next_skb)->seq += nlen;
> +
> +	if (!next_skb->len) {
> +		TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
> +		TCP_SKB_CB(skb)->eor = TCP_SKB_CB(next_skb)->eor;
> +		TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> +		tcp_unlink_write_queue(next_skb, sk);
> +		tcp_wmem_free_skb(sk, next_skb);
> +	}
> +}
> +
>  /* This routine writes packets to the network.  It advances the
>   * send_head.  This happens as incoming acks open up the remote
>   * window for us.
> @@ -2723,6 +2753,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  	max_segs = tcp_tso_segs(sk, mss_now);
>  	while ((skb = tcp_send_head(sk))) {
>  		unsigned int limit;
> +		int missing_bytes;
>  
>  		if (unlikely(tp->repair) && tp->repair_queue == TCP_SEND_QUEUE) {
>  			/* "skb_mstamp_ns" is used as a start point for the retransmit timer */
> @@ -2744,6 +2775,10 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  			else
>  				break;
>  		}
> +		cwnd_quota = min(cwnd_quota, max_segs);
> +		missing_bytes = cwnd_quota * mss_now - skb->len;
> +		if (missing_bytes > 0)
> +			tcp_grow_skb(sk, skb, missing_bytes);
>  
>  		tso_segs = tcp_set_skb_tso_segs(skb, mss_now);
>  
> @@ -2767,8 +2802,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>  		limit = mss_now;
>  		if (tso_segs > 1 && !tcp_urg_mode(tp))
>  			limit = tcp_mss_split_point(sk, skb, mss_now,
> -						    min(cwnd_quota,
> -							max_segs),
> +						    cwnd_quota,
>  						    nonagle);
>  
>  		if (skb->len > limit &&