[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=cw2JNsqm+1x4dJxmOe2xpt_ywAq3TNtaUv4hOaJL9qiA@mail.gmail.com>
Date: Mon, 13 Jun 2016 15:38:24 -0700
From: Yuchung Cheng <ycheng@...gle.com>
To: Daniel Metz <dmetz@...um.de>
Cc: netdev <netdev@...r.kernel.org>,
Hagen Paul Pfeifer <hagen@...u.net>,
Eric Dumazet <edumazet@...gle.com>,
Neal Cardwell <ncardwell@...gle.com>,
Pasi Sarolahti <pasi.sarolahti@....fi>,
Van Jacobson <vanj@...gle.com>
Subject: Re: [PATCH net-next] tcp: use RFC6298 compliant TCP RTO calculation
On Mon, Jun 13, 2016 at 1:45 PM, Daniel Metz <dmetz@...um.de> wrote:
> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
>
> In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
> calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
> less. The Linux implementation as a discrepancy to the Standard
> basically adds the defined MinRTO to the calculated RTO. When
> comparing both approaches, the Linux calculation seems to perform
> worse for sender limited TCP flows like Telnet, SSH or constant bit
> rate encoded transmissions, especially for Round Trip Times (RTT) of
> 50ms to 800ms.
>
> Compared to the Linux implementation the RFC 6298 proposed RTO
> calculation performs better and more precise in adapting to current
> network characteristics. Extensive measurements for bulk data did not
> show a negative impact of the adjusted calculation.
>
> Exemplary performance comparison for sender-limited-flows:
>
> - Rate: 10Mbit/s
> - Delay: 200ms, Delay Variation: 10ms
> - Time between each scheduled segment: 1s
> - Amount Data Segments: 300
> - Mean of 11 runs
>
> Mean Response Waiting Time [milliseconds]
>
> PER [%] | 0.5 1 1.5 2 3 5 7 10
> --------+-------------------------------------------------------
> old | 206.4 208.6 218.0 218.6 227.2 249.3 274.7 308.2
> new | 203.9 206.0 207.0 209.9 217.3 225.6 238.7 259.1
>
>
> Detailed Analysis:
>
> https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/
Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.
Since this is not a small change, and your patch is only tested on
emulation-based testbed AFAICT, I'd like to try your patch on Google
servers to get more data. But this would take a few days to setup &
collect.
Note that this paper
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
detailed rationale of current design (section 4). IMO having a "tight"
RTO is less necessary now after TLP. I am also testing a new set of
patches to install a quick reordering timer. But it's worth mentioning
the paper in the commit message.
>
>
> Signed-off-by: Hagen Paul Pfeifer <hagen@...u.net>
> Signed-off-by: Daniel Metz <dmetz@...um.de>
> Cc: Eric Dumazet <edumazet@...gle.com>
> Cc: Yuchung Cheng <ycheng@...gle.com>
> Cc: Neal Cardwell <ncardwell@...gle.com>
> ---
>
> We removed outdated comments in the code, though more cleanup may required.
>
>
> net/ipv4/tcp_input.c | 74 ++++++++++++----------------------------------------
> 1 file changed, 17 insertions(+), 57 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d6c8f4cd0..a0f66f8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
> /* Called to compute a smoothed rtt estimate. The data fed to this
> * routine either comes from timestamps, or from segments that were
> * known _not_ to have been retransmitted [see Karn/Partridge
> - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
> - * piece by Van Jacobson.
> + * Proceedings SIGCOMM 87].
> * NOTE: the next three routines used to be one big routine.
> * To save cycles in the RFC 1323 implementation it was better to break
> * it up into three procedures. -- erics
> @@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
> long m = mrtt_us; /* RTT */
> u32 srtt = tp->srtt_us;
>
> - /* The following amusing code comes from Jacobson's
> - * article in SIGCOMM '88. Note that rtt and mdev
> - * are scaled versions of rtt and mean deviation.
> - * This is designed to be as fast as possible
> - * m stands for "measurement".
> - *
> - * On a 1990 paper the rto value is changed to:
> - * RTO = rtt + 4 * mdev
> - *
> - * Funny. This algorithm seems to be very broken.
> - * These formulae increase RTO, when it should be decreased, increase
> - * too slowly, when it should be increased quickly, decrease too quickly
> - * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
> - * does not matter how to _calculate_ it. Seems, it was trap
> - * that VJ failed to avoid. 8)
> - */
> if (srtt != 0) {
> - m -= (srtt >> 3); /* m is now error in rtt est */
> - srtt += m; /* rtt = 7/8 rtt + 1/8 new */
> - if (m < 0) {
> - m = -m; /* m is now abs(error) */
> - m -= (tp->mdev_us >> 2); /* similar update on mdev */
> - /* This is similar to one of Eifel findings.
> - * Eifel blocks mdev updates when rtt decreases.
> - * This solution is a bit different: we use finer gain
> - * for mdev in this case (alpha*beta).
> - * Like Eifel it also prevents growth of rto,
> - * but also it limits too fast rto decreases,
> - * happening in pure Eifel.
> - */
> - if (m > 0)
> - m >>= 3;
> - } else {
> - m -= (tp->mdev_us >> 2); /* similar update on mdev */
> - }
> - tp->mdev_us += m; /* mdev = 3/4 mdev + 1/4 new */
> - if (tp->mdev_us > tp->mdev_max_us) {
> - tp->mdev_max_us = tp->mdev_us;
> - if (tp->mdev_max_us > tp->rttvar_us)
> - tp->rttvar_us = tp->mdev_max_us;
> - }
> - if (after(tp->snd_una, tp->rtt_seq)) {
> - if (tp->mdev_max_us < tp->rttvar_us)
> - tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
> + m -= (srtt >> 3); /* m' = m - srtt / 8 = (R' - SRTT) */
> + srtt += m; /* srtt = srtt + m’ = srtt + m - srtt / 8 */
> + if (m < 0)
> + m = -m;
> + m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
> + tp->mdev_us += m;
> + tp->rttvar_us = tp->mdev_us;
> + if (after(tp->snd_una, tp->rtt_seq))
> tp->rtt_seq = tp->snd_nxt;
> - tp->mdev_max_us = tcp_rto_min_us(sk);
> - }
> } else {
> /* no previous measure. */
> - srtt = m << 3; /* take the measured time to be rtt */
> - tp->mdev_us = m << 1; /* make sure rto = 3*rtt */
> - tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
> - tp->mdev_max_us = tp->rttvar_us;
> + srtt = m << 3;
> + tp->mdev_us = m << 1;
> + tp->rttvar_us = tp->mdev_us;
> tp->rtt_seq = tp->snd_nxt;
> }
> tp->srtt_us = max(1U, srtt);
> @@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
> */
> static void tcp_set_rto(struct sock *sk)
> {
> + const u32 min_rto = tcp_rto_min_us(sk);
> const struct tcp_sock *tp = tcp_sk(sk);
> /* Old crap is replaced with new one. 8)
> *
> @@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk)
> * is invisible. Actually, Linux-2.4 also generates erratic
> * ACKs in some circumstances.
> */
> - inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
> -
> + if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
> + inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
> + else
> + inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
> /* 2. Fixups made earlier cannot be right.
> * If we do not estimate RTO correctly without them,
> * all the algo is pure shit and should be replaced
> * with correct one. It is exactly, which we pretend to do.
> */
> -
> - /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
> - * guarantees that rto is higher.
> - */
> tcp_bound_rto(sk);
> }
>
> --
> 2.8.3
>
Powered by blists - more mailing lists