lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Sat, 28 Dec 2019 08:31:16 -0800
From:   Eric Dumazet <edumazet@...gle.com>
To:     Neal Cardwell <ncardwell@...gle.com>
Cc:     "David S . Miller" <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Eric Dumazet <eric.dumazet@...il.com>,
        Soheil Hassas Yeganeh <soheil@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Martin KaFai Lau <kafai@...com>
Subject: Re: [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for
 short RTT flows

On Fri, Dec 27, 2019 at 8:01 AM Neal Cardwell <ncardwell@...gle.com> wrote:
>
> On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > After switching ca->delay_min to usec resolution, we exit
> > slow start prematurely for very low RTT flows, setting
> > snd_ssthresh to 20.
> >
> > The reason is that delay_min is fed with RTT of small packet
> > trains. Then as cwnd is increased, TCP sends bigger TSO packets.
> >
> > LRO/GRO aggregation and/or interrupt mitigation strategies
> > on receiver tend to inflate RTT samples.
> >
> > Fix this by adding to delay_min the expected delay of
> > two TSO packets, given current pacing rate.
> ...
> > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c
> > index 068775b91fb5790e6e60a6490b49e7a266e4ed51..0e5428ed04fe4e50627e21a53c3d17f9f2dade4d 100644
> > --- a/net/ipv4/tcp_cubic.c
> > +++ b/net/ipv4/tcp_cubic.c
> > @@ -436,8 +436,27 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample)
> >                 delay = 1;
> >
> >         /* first time call or link delay decreases */
> > -       if (ca->delay_min == 0 || ca->delay_min > delay)
> > -               ca->delay_min = delay;
> > +       if (ca->delay_min == 0 || ca->delay_min > delay) {
> > +               unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
> > +
> > +               /* Account for TSO/GRO delays.
> > +                * Otherwise short RTT flows could get too small ssthresh,
> > +                * since during slow start we begin with small TSO packets
> > +                * and could lower ca->delay_min too much.
> > +                * Ideally even with a very small RTT we would like to have
> > +                * at least one TSO packet being sent and received by GRO,
> > +                * and another one in qdisc layer.
> > +                * We apply another 100% factor because @rate is doubled at
> > +                * this point.
>
> This comment mentions "@rate is doubled at this point". But AFAICT
> this part of the code is executed on every ACK, not just in slow
> start, so the rate is not really always doubled at this point. It
> might be more clear/precise to say  "@rate can be doubled at this
> point"?
>
>
> > +                * We cap the cushion to 1ms.
> > +                */
> > +               if (rate)
> > +                       delay += min_t(u64, USEC_PER_MSEC,
> > +                                      div64_ul((u64)GSO_MAX_SIZE *
> > +                                               4 * USEC_PER_SEC, rate));
> > +               if (ca->delay_min == 0 || ca->delay_min > delay)
> > +                       ca->delay_min = delay;
> > +       }
> >
> >         /* hystart triggers when cwnd is larger than some threshold */
> >         if (!ca->found && hystart && tcp_in_slow_start(tp) &&
>
> AFAICT there may be some CPU usage impact here for high-speed
> intra-datacenter flows due to an extra div64_ul() being executed on
> most ACKs? It seems like if 'ca->delay_min > delay' then the code
> executes the div64_ul() to calculate the cushion and add it to
> 'delay'. Then the  ca->delay_min is recorded with the cushion added.
>
> But the first 'ca->delay_min > delay' comparison that determines
> whether we execute the div64_ul() is comparing a ca->delay_min that
> has a cushion included to the 'delay' that has no cushion included. So
> if the raw 'delay' value is within the cushion of the raw min_rtt
> value, then this 'ca->delay_min > delay' expression will evaluate to
> true, and we'll run the div64_ul() even though there is probably not a
> need to.
>
> How big is the cushion? In back of the envelope terms, for a 10
> Gbit/sec link where CUBIC has exited the initial slow start and is
> using a pacing rate matching the link rate, then AFAICT the cushion
> should be roughly:
>   4 * 65536 * 1000000 / (1.2*10*10^9/8) ~= 175 usecs
> So if that math is in the right ballpark, then any RTT sample that is
> within roughly 175 usec of the true min_rtt is going to have
> 'ca->delay_min > delay' evaluate to true, and run the div64_ul() when
> there is probably no need.

In my tests the cushion was about ~100 usec for 10Gbit links.

>
> AFAICT there is also some risk with this patch that the steady-state
> behavior of CUBIC becomes slightly more aggressive for moderate-rate,
> low-RTT cases. In such cases this up to ~1ms of cushion to delay_min
> will cause the bictcp_update() line:
>
>  t += usecs_to_jiffies(ca->delay_min);
>
> to calculate the target cwnd at a point in time that is ~1ms further
> out than before, and thus possibly grow the cwnd faster than it would
> have before.
>
> To avoid the risk of extra divisions on many ACKs, and the risk of
> more aggressive cwnd behavior in the steady state, WDYT about an
> approach where the cushioned delay_min used by your revised Hystart
> code (ACK train heuristic and delay heuristic) is maintained
> separately from the "pure" delay_min used elsewhere in the CUBIC code,
> perhaps as a new ca->delay_min_hystart field? I am wondering about an
> approach something like:
>
> /* Account for TSO/GRO delays. ...
> */
> static u32 update_delay_min_hystart(struct sock *sk)
> {
>         unsigned long rate = READ_ONCE(sk->sk_pacing_rate);
>         u32 ack_delay = 0;
>
>         if (ca->found)
>                 return;  /* Hystart is done */
>
>         if (rate)
>                 ack_delay = min_t(u64, USEC_PER_MSEC,
>                                   div64_ul((u64)GSO_MAX_SIZE *
>                                            4 * USEC_PER_SEC, rate));
>         ca->delay_min_hystart = ca->delay_min + ack_delay;

Note that while testing this, I found that ca->delay_min_hystart was
latched while initial sk_pacing_rate is rather small.

This means we exit slow start with much higher cwnd/ssthresh (Hystart
triggers too late IMO)

$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l
-4000000; done;nstat|egrep "Hystart"
   4209
  13236
   3602
  16635
  16139
   4159
   3963
   3956
   4319
   4121
TcpExtTCPHystartTrainDetect     7                  0.0
TcpExtTCPHystartTrainCwnd       8256               0.0

$ dmesg|tail -7
[  144.637752]  ack_train delay_min 23 (delay_min_hystart 348) cwnd 1704
[  146.924494]  ack_train delay_min 28 (delay_min_hystart 202) cwnd 480
[  150.338917]  ack_train delay_min 20 (delay_min_hystart 264) cwnd 1614
[  151.481355]  ack_train delay_min 23 (delay_min_hystart 172) cwnd 1026
[  152.607829]  ack_train delay_min 20 (delay_min_hystart 241) cwnd 1071
[  153.742571]  ack_train delay_min 20 (delay_min_hystart 237) cwnd 1434
[  154.890514]  ack_train delay_min 21 (delay_min_hystart 116) cwnd 927


We might have to compute the ack_delay even if ca->delay_min has not
been lowered (since sk_pacing_rate is doubling every RTT regardless of
delay_min changes)

> }
>
> ...
>         if (ca->delay_min == 0 || ca->delay_min > delay) {
>                 ca->delay_min = delay;
>                 update_delay_min_hystart(sk);
>         }
>
> WDYT?


I think these are all very good suggestions Neal, I cooked a patch
that I will send after more tests.

Thanks for spending time on this stuff, especially at this time of the year :)

Powered by blists - more mailing lists