[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20220307202917.04005301@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>
Date: Mon, 7 Mar 2022 20:29:17 -0800
From: Jakub Kicinski <kuba@...nel.org>
To: Eric Dumazet <edumazet@...gle.com>
Cc: netdev <netdev@...r.kernel.org>,
Willem de Bruijn <willemb@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [RFC net-next] tcp: allow larger TSO to be built under overload
On Mon, 7 Mar 2022 19:50:10 -0800 Eric Dumazet wrote:
> On Mon, Mar 7, 2022 at 7:03 PM Jakub Kicinski <kuba@...nel.org> wrote:
> >
> > We observed Tx-heavy workloads causing softirq overload because
> > with increased load and therefore latency the pacing rates fall,
> > pushing TCP to generate smaller and smaller TSO packets.
>
> Yes, we saw this behavior but came up with something more generic,
> also helping the common case. Cooking larger TSO is really a function
> of the radius (distance between peers)
Excellent, I was hoping you have a better fix :)
> > It seems reasonable to allow larger packets to be built when
> > system is under stress. TCP already uses the
> >
> > this_cpu_ksoftirqd() == current
> >
> > condition as an indication of overload for TSQ scheduling.
> >
> > Signed-off-by: Jakub Kicinski <kuba@...nel.org>
> > ---
> > Sending as an RFC because it seems reasonable, but really
> > I haven't run any large scale testing, yet. Bumping
> > tcp_min_tso_segs to prevent overloads is okay but it
> > seems like we can do better since we only need coarser
> > pacing once disaster strikes?
> >
> > The downsides are that users may have already increased
> > the value to what's needed during overload, or applied
> > the same logic in out-of-tree CA algo implementations
> > (only BBR implements ca_ops->min_tso_segs() upstream).
> >
>
> Unfortunately this would make packetdrill flaky, thus break our tests.
>
> Also, I would guess the pacing decreases because CWND is small anyway,
> or RTT increases ?
Both increase - CWND can go up to the 256-512 bucket (in a histogram)
but latency gets insane as the machine tries to pump out 2kB segments,
doing a lot of splitting and barely services the ACK from the Rx ring.
With a Rx ring of a few thousand packets latency crosses 250ms,
in-building. I've seen srtt_us > 1M.
> What CC are you using ?
A mix of CUBIC and DCTCP for this application, primarily DCTCP.
> The issue I see here is that bi modal behavior will cause all kinds of
> artifacts.
>
> BBR2 has something to give an extra allowance based on min_rtt.
>
> I think we should adopt this for all CC, because it is not bi-modal,
> and even allow full size TSO packets
> for hosts in the same rack.
Using min_rtt makes perfect sense in the case I saw.
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 2319531267c6830b633768dea7f0b40a46633ee1..02ec5866a05ffc2920ead95e9a65cc1ba77459c7
> 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1956,20 +1956,34 @@ static bool tcp_nagle_check(bool partial,
> const struct tcp_sock *tp,
> static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now,
> int min_tso_segs)
> {
> - u32 bytes, segs;
> +/* Use min_rtt to help adapt TSO burst size, with smaller min_rtt resulting
> + * in bigger TSO bursts. By default we cut the RTT-based allowance in half
> + * for every 2^9 usec (aka 512 us) of RTT, so that the RTT-based allowance
> + * is below 1500 bytes after 6 * ~500 usec = 3ms.
> + * Default: halve allowance per 2^9 usecs, 512us.
> + */
> + const u32 rtt_shift = 9;
> + unsigned long bytes;
> + u32 r;
> +
> + bytes = sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift);
> + /* Budget a TSO/GSO burst size allowance based on min_rtt. For every
> + * K = 2^tso_rtt_shift microseconds of min_rtt, halve the burst.
> + * The min_rtt-based burst allowance is: 64 KBytes / 2^(min_rtt/K)
> + */
> + r = tcp_min_rtt(tcp_sk(sk)) >> rtt_shift;
> + if (r < BITS_PER_TYPE(u32))
> + bytes += GSO_MAX_SIZE >> r;
> +
> + bytes = min_t(unsigned long, bytes, sk->sk_gso_max_size);
>
> - bytes = min_t(unsigned long,
> - sk->sk_pacing_rate >> READ_ONCE(sk->sk_pacing_shift),
> - sk->sk_gso_max_size);
>
> /* Goal is to send at least one packet per ms,
> * not one big TSO packet every 100 ms.
> * This preserves ACK clocking and is consistent
> * with tcp_tso_should_defer() heuristic.
> */
> - segs = max_t(u32, bytes / mss_now, min_tso_segs);
> -
> - return segs;
> + return max_t(u32, bytes / mss_now, min_tso_segs);
> }
>
> /* Return the number of segments we want in the skb we are transmitting.
Powered by blists - more mailing lists