lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening PHC | |
Open Source and information security mailing list archives
| ||
|
Date: Sat, 28 Dec 2019 08:31:16 -0800 From: Eric Dumazet <edumazet@...gle.com> To: Neal Cardwell <ncardwell@...gle.com> Cc: "David S . Miller" <davem@...emloft.net>, netdev <netdev@...r.kernel.org>, Eric Dumazet <eric.dumazet@...il.com>, Soheil Hassas Yeganeh <soheil@...gle.com>, Yuchung Cheng <ycheng@...gle.com>, Martin KaFai Lau <kafai@...com> Subject: Re: [PATCH net-next v2 4/5] tcp_cubic: tweak Hystart detection for short RTT flows On Fri, Dec 27, 2019 at 8:01 AM Neal Cardwell <ncardwell@...gle.com> wrote: > > On Mon, Dec 23, 2019 at 3:28 PM Eric Dumazet <edumazet@...gle.com> wrote: > > > > After switching ca->delay_min to usec resolution, we exit > > slow start prematurely for very low RTT flows, setting > > snd_ssthresh to 20. > > > > The reason is that delay_min is fed with RTT of small packet > > trains. Then as cwnd is increased, TCP sends bigger TSO packets. > > > > LRO/GRO aggregation and/or interrupt mitigation strategies > > on receiver tend to inflate RTT samples. > > > > Fix this by adding to delay_min the expected delay of > > two TSO packets, given current pacing rate. > ... > > diff --git a/net/ipv4/tcp_cubic.c b/net/ipv4/tcp_cubic.c > > index 068775b91fb5790e6e60a6490b49e7a266e4ed51..0e5428ed04fe4e50627e21a53c3d17f9f2dade4d 100644 > > --- a/net/ipv4/tcp_cubic.c > > +++ b/net/ipv4/tcp_cubic.c > > @@ -436,8 +436,27 @@ static void bictcp_acked(struct sock *sk, const struct ack_sample *sample) > > delay = 1; > > > > /* first time call or link delay decreases */ > > - if (ca->delay_min == 0 || ca->delay_min > delay) > > - ca->delay_min = delay; > > + if (ca->delay_min == 0 || ca->delay_min > delay) { > > + unsigned long rate = READ_ONCE(sk->sk_pacing_rate); > > + > > + /* Account for TSO/GRO delays. > > + * Otherwise short RTT flows could get too small ssthresh, > > + * since during slow start we begin with small TSO packets > > + * and could lower ca->delay_min too much. > > + * Ideally even with a very small RTT we would like to have > > + * at least one TSO packet being sent and received by GRO, > > + * and another one in qdisc layer. > > + * We apply another 100% factor because @rate is doubled at > > + * this point. > > This comment mentions "@rate is doubled at this point". But AFAICT > this part of the code is executed on every ACK, not just in slow > start, so the rate is not really always doubled at this point. It > might be more clear/precise to say "@rate can be doubled at this > point"? > > > > + * We cap the cushion to 1ms. > > + */ > > + if (rate) > > + delay += min_t(u64, USEC_PER_MSEC, > > + div64_ul((u64)GSO_MAX_SIZE * > > + 4 * USEC_PER_SEC, rate)); > > + if (ca->delay_min == 0 || ca->delay_min > delay) > > + ca->delay_min = delay; > > + } > > > > /* hystart triggers when cwnd is larger than some threshold */ > > if (!ca->found && hystart && tcp_in_slow_start(tp) && > > AFAICT there may be some CPU usage impact here for high-speed > intra-datacenter flows due to an extra div64_ul() being executed on > most ACKs? It seems like if 'ca->delay_min > delay' then the code > executes the div64_ul() to calculate the cushion and add it to > 'delay'. Then the ca->delay_min is recorded with the cushion added. > > But the first 'ca->delay_min > delay' comparison that determines > whether we execute the div64_ul() is comparing a ca->delay_min that > has a cushion included to the 'delay' that has no cushion included. So > if the raw 'delay' value is within the cushion of the raw min_rtt > value, then this 'ca->delay_min > delay' expression will evaluate to > true, and we'll run the div64_ul() even though there is probably not a > need to. > > How big is the cushion? In back of the envelope terms, for a 10 > Gbit/sec link where CUBIC has exited the initial slow start and is > using a pacing rate matching the link rate, then AFAICT the cushion > should be roughly: > 4 * 65536 * 1000000 / (1.2*10*10^9/8) ~= 175 usecs > So if that math is in the right ballpark, then any RTT sample that is > within roughly 175 usec of the true min_rtt is going to have > 'ca->delay_min > delay' evaluate to true, and run the div64_ul() when > there is probably no need. In my tests the cushion was about ~100 usec for 10Gbit links. > > AFAICT there is also some risk with this patch that the steady-state > behavior of CUBIC becomes slightly more aggressive for moderate-rate, > low-RTT cases. In such cases this up to ~1ms of cushion to delay_min > will cause the bictcp_update() line: > > t += usecs_to_jiffies(ca->delay_min); > > to calculate the target cwnd at a point in time that is ~1ms further > out than before, and thus possibly grow the cwnd faster than it would > have before. > > To avoid the risk of extra divisions on many ACKs, and the risk of > more aggressive cwnd behavior in the steady state, WDYT about an > approach where the cushioned delay_min used by your revised Hystart > code (ACK train heuristic and delay heuristic) is maintained > separately from the "pure" delay_min used elsewhere in the CUBIC code, > perhaps as a new ca->delay_min_hystart field? I am wondering about an > approach something like: > > /* Account for TSO/GRO delays. ... > */ > static u32 update_delay_min_hystart(struct sock *sk) > { > unsigned long rate = READ_ONCE(sk->sk_pacing_rate); > u32 ack_delay = 0; > > if (ca->found) > return; /* Hystart is done */ > > if (rate) > ack_delay = min_t(u64, USEC_PER_MSEC, > div64_ul((u64)GSO_MAX_SIZE * > 4 * USEC_PER_SEC, rate)); > ca->delay_min_hystart = ca->delay_min + ack_delay; Note that while testing this, I found that ca->delay_min_hystart was latched while initial sk_pacing_rate is rather small. This means we exit slow start with much higher cwnd/ssthresh (Hystart triggers too late IMO) $ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart" 4209 13236 3602 16635 16139 4159 3963 3956 4319 4121 TcpExtTCPHystartTrainDetect 7 0.0 TcpExtTCPHystartTrainCwnd 8256 0.0 $ dmesg|tail -7 [ 144.637752] ack_train delay_min 23 (delay_min_hystart 348) cwnd 1704 [ 146.924494] ack_train delay_min 28 (delay_min_hystart 202) cwnd 480 [ 150.338917] ack_train delay_min 20 (delay_min_hystart 264) cwnd 1614 [ 151.481355] ack_train delay_min 23 (delay_min_hystart 172) cwnd 1026 [ 152.607829] ack_train delay_min 20 (delay_min_hystart 241) cwnd 1071 [ 153.742571] ack_train delay_min 20 (delay_min_hystart 237) cwnd 1434 [ 154.890514] ack_train delay_min 21 (delay_min_hystart 116) cwnd 927 We might have to compute the ack_delay even if ca->delay_min has not been lowered (since sk_pacing_rate is doubling every RTT regardless of delay_min changes) > } > > ... > if (ca->delay_min == 0 || ca->delay_min > delay) { > ca->delay_min = delay; > update_delay_min_hystart(sk); > } > > WDYT? I think these are all very good suggestions Neal, I cooked a patch that I will send after more tests. Thanks for spending time on this stuff, especially at this time of the year :)
Powered by blists - more mailing lists