netdev - Re: [RFC net-next] tcp: allow larger TSO to be built under overload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANn89iLoWOdLQWB0PeTtbOtzkAT=cWgzy5_RXqqLchZu1GziZw@mail.gmail.com>
Date:   Mon, 7 Mar 2022 19:50:10 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Jakub Kicinski <kuba@...nel.org>
Cc:     netdev <netdev@...r.kernel.org>,
        Willem de Bruijn <willemb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [RFC net-next] tcp: allow larger TSO to be built under overload

On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@...nel.org> wrote:
>
> We observed Tx-heavy workloads causing softirq overload because
> with increased load and therefore latency the pacing rates fall,
> pushing TCP to generate smaller and smaller TSO packets.

Yes, we saw this behavior but came up with something more generic,
also helping the common case. Cooking larger TSO is really a function
of the radius (distance between peers)

> It seems reasonable to allow larger packets to be built when
> system is under stress. TCP already uses the
>
>   this_cpu_ksoftirqd() == current
>
> condition as an indication of overload for TSQ scheduling.
>
> Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> ---
> Sending as an RFC because it seems reasonable, but really
> I haven't run any large scale testing, yet. Bumping
> tcp_min_tso_segs to prevent overloads is okay but it
> seems like we can do better since we only need coarser
> pacing once disaster strikes?
>
> The downsides are that users may have already increased
> the value to what's needed during overload, or applied
> the same logic in out-of-tree CA algo implementations
> (only BBR implements ca_ops->min_tso_segs() upstream).
>

Unfortunately this would make packetdrill flaky, thus break our tests.

Also, I would guess the pacing decreases because CWND is small anyway,
or RTT increases ?
What CC are you using ?
The issue I see here is that bi modal behavior will cause all kinds of
artifacts.

BBR2 has something to give an extra allowance based on min_rtt.

I think we should adopt this for all CC, because it is not bi-modal,
and even allow full size TSO packets
for hosts in the same rack.

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
const struct tcp_sock *tp,
 static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
                            int min_tso_segs)
 {
-       u32 bytes, segs;
+/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
+ * in bigger TSO bursts. By default we cut the RTT-based allowance in half
+ * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
+ * is below 1500 bytes after 6 * ~500 usec = 3ms.
+ * Default: halve allowance per 2^9 usecs, 512us.
+ */
+       const u32 rtt_shift = 9;
+       unsigned long bytes;
+       u32 r;
+
+       bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
+       /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
+        * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
+        * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
+        */
+       r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
+       if (r < BITS_PER_TYPE(u32))
+               bytes += GSO_MAX_SIZE >> r;
+
+       bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);

-       bytes = min_t(unsigned long,
-                     sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
-                     sk->sk_gso_max_size);

        /* Goal is to send at least one packet per ms,
         * not one big TSO packet every 100 ms.
         * This preserves ACK clocking and is consistent
         * with tcp_tso_should_defer() heuristic.
         */
-       segs = max_t(u32, bytes / mss_now, min_tso_segs);
-
-       return segs;
+       return max_t(u32, bytes / mss_now, min_tso_segs);
 }

 /* Return the number of segments we want in the skb we are transmitting.