netdev - Re: [RFC net-next] tcp: allow larger TSO to be built under overload

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20220307202917.04005301@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date:   Mon, 7 Mar 2022 20:29:17 -0800
From:   Jakub Kicinski <kuba@...nel.org>
To:     Eric Dumazet <edumazet@...gle.com>
Cc:     netdev <netdev@...r.kernel.org>,
        Willem de Bruijn <willemb@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [RFC net-next] tcp: allow larger TSO to be built under overload

On Mon, 7 Mar 2022 19:50:10 -0800 Eric Dumazet wrote:
> On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@...nel.org> wrote:
> >
> > We observed Tx-heavy workloads causing softirq overload because
> > with increased load and therefore latency the pacing rates fall,
> > pushing TCP to generate smaller and smaller TSO packets.  
> 
> Yes, we saw this behavior but came up with something more generic,
> also helping the common case. Cooking larger TSO is really a function
> of the radius (distance between peers)

Excellent, I was hoping you have a better fix :)

> > It seems reasonable to allow larger packets to be built when
> > system is under stress. TCP already uses the
> >
> >   this_cpu_ksoftirqd() == current
> >
> > condition as an indication of overload for TSQ scheduling.
> >
> > Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> > ---
> > Sending as an RFC because it seems reasonable, but really
> > I haven't run any large scale testing, yet. Bumping
> > tcp_min_tso_segs to prevent overloads is okay but it
> > seems like we can do better since we only need coarser
> > pacing once disaster strikes?
> >
> > The downsides are that users may have already increased
> > the value to what's needed during overload, or applied
> > the same logic in out-of-tree CA algo implementations
> > (only BBR implements ca_ops->min_tso_segs() upstream).
> >  
> 
> Unfortunately this would make packetdrill flaky, thus break our tests.
> 
> Also, I would guess the pacing decreases because CWND is small anyway,
> or RTT increases ?

Both increase - CWND can go up to the 256-512 bucket (in a histogram)
but latency gets insane as the machine tries to pump out 2kB segments,
doing a lot of splitting and barely services the ACK from the Rx ring. 

With a Rx ring of a few thousand packets latency crosses 250ms,
in-building. I've seen srtt_us > 1M.

> What CC are you using ?

A mix of CUBIC and DCTCP for this application, primarily DCTCP.

> The issue I see here is that bi modal behavior will cause all kinds of
> artifacts.
> 
> BBR2 has something to give an extra allowance based on min_rtt.
> 
> I think we should adopt this for all CC, because it is not bi-modal,
> and even allow full size TSO packets
> for hosts in the same rack.

Using min_rtt makes perfect sense in the case I saw.

> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
> const struct tcp_sock *tp,
>  static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
>                             int min_tso_segs)
>  {
> -       u32 bytes, segs;
> +/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
> + * in bigger TSO bursts. By default we cut the RTT-based allowance in half
> + * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
> + * is below 1500 bytes after 6 * ~500 usec = 3ms.
> + * Default: halve allowance per 2^9 usecs, 512us.
> + */
> +       const u32 rtt_shift = 9;
> +       unsigned long bytes;
> +       u32 r;
> +
> +       bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
> +       /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
> +        * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
> +        * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
> +        */
> +       r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
> +       if (r < BITS_PER_TYPE(u32))
> +               bytes += GSO_MAX_SIZE >> r;
> +
> +       bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
> 
> -       bytes = min_t(unsigned long,
> -                     sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
> -                     sk->sk_gso_max_size);
> 
>         /* Goal is to send at least one packet per ms,
>          * not one big TSO packet every 100 ms.
>          * This preserves ACK clocking and is consistent
>          * with tcp_tso_should_defer() heuristic.
>          */
> -       segs = max_t(u32, bytes / mss_now, min_tso_segs);
> -
> -       return segs;
> +       return max_t(u32, bytes / mss_now, min_tso_segs);
>  }
> 
>  /* Return the number of segments we want in the skb we are transmitting.