lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANn89iJUBujG2AOBYsr0V7qyC5WTgzx0GucO=2ES69tTDJRziw@mail.gmail.com>
Date: Mon, 16 Oct 2023 12:35:40 +0200
From: Eric Dumazet <edumazet@...gle.com>
To: Stefan Wahren <wahrenst@....net>
Cc: Neal Cardwell <ncardwell@...gle.com>, Jakub Kicinski <kuba@...nel.org>, 
	Fabio Estevam <festevam@...il.com>, linux-imx@....com, 
	Stefan Wahren <stefan.wahren@...rgebyte.com>, Michael Heimpold <mhei@...mpold.de>, netdev@...r.kernel.org, 
	Yuchung Cheng <ycheng@...gle.com>
Subject: Re: iperf performance regression since Linux 5.18

On Mon, Oct 16, 2023 at 11:49 AM Eric Dumazet <edumazet@...gle.com> wrote:
>
> On Sun, Oct 15, 2023 at 12:23 PM Stefan Wahren <wahrenst@....net> wrote:
> >
> > Hi,
> >
> > Am 15.10.23 um 01:26 schrieb Stefan Wahren:
> > > Hi Eric,
> > >
> > > Am 15.10.23 um 00:51 schrieb Eric Dumazet:
> > >> On Sat, Oct 14, 2023 at 9:40 PM Neal Cardwell <ncardwell@...gle.com>
> > >> wrote:
> > ...
> > >> Hmm, we receive ~3200 acks per second, I am not sure the
> > >> tcp_tso_should_defer() logic
> > >> would hurt ?
> > >>
> > >> Also the ss binary on the client seems very old, or its output has
> > >> been mangled perhaps ?
> > > this binary is from Yocto kirkstone:
> > >
> > > # ss --version
> > > ss utility, iproute2-5.17.0
> > >
> > > This shouldn't be too old. Maybe some missing kernel settings?
> > >
> > i think i was able to fix the issue by enable the proper kernel
> > settings. I rerun initial bad and good case again and overwrote the log
> > files:
> >
> > https://github.com/lategoodbye/tcp_tso_rtt_log_regress/commit/93615c94ba1bf36bd47cc2b91dd44a3f58c601bc
>
> Excellent, thanks.
>
> I see your kernel uses HZ=100, have you tried HZ=1000 by any chance ?
>
> CONFIG_HZ_1000=y
> CONFIG_HZ=1000
>
> I see that the bad run seems to be stuck for a while with cwnd=66, but
> a smaller amount of packets in flight (26 in following ss extract)
>
> ESTAB 0 315664 192.168.1.12:60542 192.168.1.129:5001
> timer:(on,030ms,0) ino:13011 sk:2 <->
> skmem:(r0,rb131072,t48488,tb295680,f3696,w319888,o0,bl0,d0) ts sack
> cubic wscale:7,6 rto:210 rtt:3.418/1.117 mss:1448 pmtu:1500 rcvmss:536
> advmss:1448 cwnd:66 ssthresh:20 bytes_sent:43874400
> bytes_acked:43836753 segs_out:30302 segs_in:14110 data_segs_out:30300
> send 223681685bps lastsnd:10 lastrcv:4310 pacing_rate 268408200bps
> delivery_rate 46336000bps delivered:30275 busy:4310ms unacked:26
> rcv_space:14480 rcv_ssthresh:64088 notsent:278016 minrtt:0.744
>
> I wonder if fec pseudo-tso code is adding some kind of artifacts,
> maybe with TCP small queue logic.
> (TX completion might be delayed too much on fec driver side)

Speaking of TSQ, it seems an old change (commit 75eefc6c59fd "tcp:
tsq: add a shortcut in tcp_small_queue_check()")
has been accidentally removed in 2017 (75c119afe14f "tcp: implement
rb-tree based retransmit queue")

Could you try this fix:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9c8c42c280b7638f0f4d94d68cd2c73e3c6c2bcc..e61a3a381d51b554ec8440928e22a290712f0b6b
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2542,6 +2542,18 @@ static bool tcp_pacing_check(struct sock *sk)
        return true;
 }

+static bool tcp_rtx_queue_empty_or_single_skb(const struct sock *sk)
+{
+       const struct rb_node *node = sk->tcp_rtx_queue.rb_node;
+
+       /* No skb in the rtx queue. */
+       if (!node)
+               return true;
+
+       /* Only one skb in rtx queue. */
+       return !node->rb_left && !node->rb_right;
+}
+
 /* TCP Small Queues :
  * Control number of packets in qdisc/devices to two packets / or ~1 ms.
  * (These limits are doubled for retransmits)
@@ -2579,12 +2591,12 @@ static bool tcp_small_queue_check(struct sock
*sk, const struct sk_buff *skb,
                limit += extra_bytes;
        }
        if (refcount_read(&sk->sk_wmem_alloc) > limit) {
-               /* Always send skb if rtx queue is empty.
+               /* Always send skb if rtx queue is empty or has one skb.
                 * No need to wait for TX completion to call us back,
                 * after softirq/tasklet schedule.
                 * This helps when TX completions are delayed too much.
                 */
-               if (tcp_rtx_queue_empty(sk))
+               if (tcp_rtx_queue_empty_or_single_skb(sk))
                        return false;

                set_bit(TSQ_THROTTLED, &sk->sk_tsq_flags);

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ