lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPshTCg8W0QxubvO_dWBsSOqvCEdBSWLKv3cNOvn=VbyqHA1zg@mail.gmail.com>
Date:	Tue, 5 Jun 2012 10:42:21 -0700
From:	Jerry Chu <hkchu@...gle.com>
To:	Damian Lukowski <damian@....rwth-aachen.de>
Cc:	Netdev <netdev@...r.kernel.org>,
	David Miller <davem@...emloft.net>,
	Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
Subject: Re: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
 threshold as a time value.

On Mon, Jun 4, 2012 at 4:50 PM, Jerry Chu <hkchu@...gle.com> wrote:
> Hi Damian,
>
> On Mon, Jun 4, 2012 at 10:50 AM, Damian Lukowski
> <damian@....rwth-aachen.de> wrote:
>> Hi Jerry,
>>
>> please verify, I understood you correctly.
>>
>> You have set TCP_RTO_MIN to a lower value, e.g. 0.002 seconds to improve
>> your internal low-latency traffic. Because of the improvement, R1
>> timeouts are triggered too fast for external high-RTT traffic. Is that
>> correct?
>
> Correct.
>
>> If so, may I suggest to set tcp_retries1 to a higher value? For
>> TCP_RTO_MIN == 0.002 and tcp_retries1 ==  10, R1 will be calculated to
>> approximately 4 seconds.
>
> I think hacking tcp_retries1 is the wrong solution. E.g., 10 retries may be too
> generous for those short RTT flows.
>
> I think the fundamental problem is - the ideal fix for your original RTO revert
> problem should've used the per-flow RTO to compute R1 & R2. But that
> computation may be too expensive so you used TCP_RTO_MIN as an
> approximation - not a good idea IMHO!

Just realized the correct fix of using the original, non-backoff per flow RTO is
not any more expensive than the current code through ilog2(). What's needed
is a new field "base_rto" to record the original RTO before backoff. I'm leaning
toward this more accurate fix now without any fudge because fudging almost
always causes bugs.

Any comment is welcome. I'm not sure in the existing code if it makes sense
to apply the exponential backoff based computation to thin stream but it's a
separate question so I won't touch it.

Jerry

>
> The easiest solution I can see so far is to replace the check
>
> if (!inet_csk(sk)->icsk_retransmits)
>                return false;
>
> at the beginning of retransmits_timed_out() with
>
> if (inet_csk(sk)->icsk_retransmits < boundary)
>                return false;
>
> Best,
>
> Jerry
>
>>
>> Is that ok?
>>
>> Best regards
>>  Damian
>>
>> Am Freitag, den 01.06.2012, 15:58 -0700 schrieb Jerry Chu:
>>> > From: Damian Lukowski <damian@....rwth-aachen.de>
>>> > Date: Wed, Aug 26, 2009 at 3:16 AM
>>> > Subject: [PATCH 3/3] Revert Backoff [v3]: Calculate TCP's connection close
>>> > threshold as a time value.
>>> > To: Netdev <netdev@...r.kernel.org>
>>> >
>>> >
>>> > RFC 1122 specifies two threshold values R1 and R2 for connection timeouts,
>>> > which may represent a number of allowed retransmissions or a timeout value.
>>> > Currently linux uses sysctl_tcp_retries{1,2} to specify the thresholds
>>> > in number of allowed retransmissions.
>>> >
>>> > For any desired threshold R2 (by means of time) one can specify tcp_retries2
>>> > (by means of number of retransmissions) such that TCP will not time out
>>> > earlier than R2. This is the case, because the RTO schedule follows a fixed
>>> > pattern, namely exponential backoff.
>>> >
>>> > However, the RTO behaviour is not predictable any more if RTO backoffs can
>>> > be
>>> > reverted, as it is the case in the draft
>>> > "Make TCP more Robust to Long Connectivity Disruptions"
>>> > (http://tools.ietf.org/html/draft-zimmermann-tcp-lcd).
>>> >
>>> > In the worst case TCP would time out a connection after 3.2 seconds, if the
>>> > initial RTO equaled MIN_RTO and each backoff has been reverted.
>>> >
>>> > This patch introduces a function retransmits_timed_out(N),
>>> > which calculates the timeout of a TCP connection, assuming an initial
>>> > RTO of MIN_RTO and N unsuccessful, exponentially backed-off retransmissions.
>>> >
>>> > Whenever timeout decisions are made by comparing the retransmission counter
>>> > to some value N, this function can be used, instead.
>>> >
>>> > The meaning of tcp_retries2 will be changed, as many more RTO
>>> > retransmissions
>>> > can occur than the value indicates. However, it yields a timeout which is
>>> > similar to the one of an unpatched, exponentially backing off TCP in the
>>> > same
>>> > scenario. As no application could rely on an RTO greater than MIN_RTO, there
>>> > should be no risk of a regression.
>>>
>>> This looks like a typical "fix one problem, introducing a few more" patch :(.
>>> What do you mean by "no application could rely on an RTO greater than
>>> MIN_RTO..."
>>> above? How can you make the assumption that RTO is not too far off
>>> from TCP_RTO_MIN?
>>>
>>> While you tried to address a problem where the retransmission count
>>> was high but the actual
>>> timeout duration was too short, have you considered the other case
>>> around, i.e., the timeout
>>> duration is long but the retransmission count is too short? This is
>>> exactly what's happening
>>> to us with your patch. We've much reduced TCP_RTO_MIN for our internal
>>> traffic, but not
>>> noticing your change has severely shortened the R1 & R2 recommended by
>>> RFC1122 for our
>>> long haul traffic until now. In many cases R1 threshold was met upon
>>> the first retrans timeout.
>>>
>>> I think retransmits_timed_out() should check against both time
>>> duration and retrans count
>>> (icsk_retransmits).
>>>
>>> Thought?
>>>
>>> Jerry
>>>
>>> >
>>> > Signed-off-by: Damian Lukowski <damian@....rwth-aachen.de>
>>> > ---
>>> >  include/net/tcp.h    |   18 ++++++++++++++++++
>>> >  net/ipv4/tcp_timer.c |   11 +++++++----
>>> >  2 files changed, 25 insertions(+), 4 deletions(-)
>>> >
>>> > diff --git a/include/net/tcp.h b/include/net/tcp.h
>>> > index c35b329..17d1a88 100644
>>> > --- a/include/net/tcp.h
>>> > +++ b/include/net/tcp.h
>>> > @@ -1247,6 +1247,24 @@ static inline struct sk_buff
>>> > *tcp_write_queue_prev(struct sock *sk, struct sk_bu
>>> >  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>>> >        skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>>> >
>>> > +static inline bool retransmits_timed_out(const struct sock *sk,
>>> > +                                        unsigned int boundary)
>>> > +{
>>> > +       int limit, K;
>>> > +       if (!inet_csk(sk)->icsk_retransmits)
>>> > +               return false;
>>> > +
>>> > +       K = ilog2(TCP_RTO_MAX/TCP_RTO_MIN);
>>> > +
>>> > +       if (boundary <= K)
>>> > +               limit = ((2 << boundary) - 1) * TCP_RTO_MIN;
>>> > +       else
>>> > +               limit = ((2 << K) - 1) * TCP_RTO_MIN +
>>> > +                       (boundary - K) * TCP_RTO_MAX;
>>> > +
>>> > +       return (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= limit;
>>> > +}
>>> > +
>>> >  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>>> >  {
>>> >        return sk->sk_send_head;
>>> > diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> > index a3ba494..2972d7b 100644
>>> > --- a/net/ipv4/tcp_timer.c
>>> > +++ b/net/ipv4/tcp_timer.c
>>> > @@ -137,13 +137,14 @@ static int tcp_write_timeout(struct sock *sk)
>>> >  {
>>> >        struct inet_connection_sock *icsk = inet_csk(sk);
>>> >        int retry_until;
>>> > +       bool do_reset;
>>> >
>>> >        if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>> >                if (icsk->icsk_retransmits)
>>> >                        dst_negative_advice(&sk->sk_dst_cache);
>>> >                retry_until = icsk->icsk_syn_retries ? :
>>> > sysctl_tcp_syn_retries;
>>> >        } else {
>>> > -               if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>>> > +               if (retransmits_timed_out(sk, sysctl_tcp_retries1)) {
>>> >                        /* Black hole detection */
>>> >                        tcp_mtu_probing(icsk, sk);
>>> >
>>> > @@ -155,13 +156,15 @@ static int tcp_write_timeout(struct sock *sk)
>>> >                        const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
>>> >
>>> >                        retry_until = tcp_orphan_retries(sk, alive);
>>> > +                       do_reset = alive ||
>>> > +                                  !retransmits_timed_out(sk, retry_until);
>>> >
>>> > -                       if (tcp_out_of_resources(sk, alive ||
>>> > icsk->icsk_retransmits < retry_until))
>>> > +                       if (tcp_out_of_resources(sk, do_reset))
>>> >                                return 1;
>>> >                }
>>> >        }
>>> >
>>> > -       if (icsk->icsk_retransmits >= retry_until) {
>>> > +       if (retransmits_timed_out(sk, retry_until)) {
>>> >                /* Has it gone just too far? */
>>> >                tcp_write_err(sk);
>>> >                return 1;
>>> > @@ -385,7 +388,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>> >  out_reset_timer:
>>> >        icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>> >        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
>>> > TCP_RTO_MAX);
>>> > -       if (icsk->icsk_retransmits > sysctl_tcp_retries1)
>>> > +       if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
>>> >                __sk_dst_reset(sk);
>>> >
>>> >  out:;
>>> > --
>>> > 1.6.3.3
>>> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe netdev" in
>>> > the body of a message to majordomo@...r.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>>
>>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ