netdev - Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <D42294A7.1FB0A%brakmo@fb.com>
Date:   Tue, 11 Oct 2016 21:08:03 +0000
From:   Lawrence Brakmo <brakmo@...com>
To:     Yuchung Cheng <ycheng@...gle.com>
CC:     netdev <netdev@...r.kernel.org>, Kernel Team <Kernel-team@...com>,
        "Eric Dumazet" <eric.dumazet@...il.com>,
        Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH net-next] tcp: Change txhash on some non-RTO retransmits

Yuchung, thank you for your comments. Responses inline.

On 10/11/16, 12:49 PM, "Yuchung Cheng" <ycheng@...gle.com> wrote:

>On Mon, Oct 10, 2016 at 5:18 PM, Lawrence Brakmo <brakmo@...com> wrote:
>>
>> The purpose of this patch is to help balance flows across paths. A new
>> sysctl "tcp_retrans_txhash_prob" specifies the probability (0-100) that
>> the txhash (IPv6 flowlabel) will be changed after a non-RTO retransmit.
>> A probability is used in order to control how many flows are moved
>> during a congestion event and prevent the congested path from becoming
>> under utilized (which could occur if too many flows leave the current
>> path). Txhash changes may be delayed in order to decrease the likelihood
>> that it will trigger retransmists due to too much reordering.
>>
>> Another sysctl "tcp_retrans_txhash_mode" determines the behavior after
>> RTOs. If the sysctl is 0, then after an RTO, only RTOs can trigger
>> txhash changes. The idea is to decrease the likelihood of going back
>> to a broken path. That is, we don't want flow balancing to trigger
>> changes to broken paths. The drawback is that flow balancing does
>> not work as well. If the sysctl is greater than 1, then we always
>> do flow balancing, even after RTOs.
>>
>> Tested with packedrill tests (for correctness) and performance
>> experiments with 2 and 3 paths. Performance experiments looked at
>> aggregate goodput and fairness. For each run, we looked at the ratio of
>> the goodputs for the fastest and slowest flows. These were averaged for
>> all the runs. A fairness of 1 means all flows had the same goodput, a
>> fairness of 2 means the fastest flow was twice as fast as the slowest
>> flow.
>>
>> The setup for the performance experiments was 4 or 5 serves in a rack,
>> 10G links. I tested various probabilities, but 20 seemed to have the
>> best tradeoff for my setup (small RTTs).
>>
>>                       --- node1 -----
>>     sender --- switch --- node2 ----- switch ---- receiver
>>                       --- node3 -----
>>
>> Scenario 1: One sender sends to one receiver through 2 routes (node1 or
>> node 2). The output from node1 and node2 is 1G (1gbit/sec). With only 2
>> flows, without flow balancing (prob=0) the average goodput is 1.6G vs.
>> 1.9G with flow balancing due to 2 flows ending up in one link and either
>> not moving and taking some time to move. Fairness was 1 in all cases.
>> For 7 flows, goodput was 1.9G for all, but fairness was 1.5, 1.4 or 1.2
>> for prob=0, prob=20,mode=0 and prob=20,mode=1 respectively. That is,
>> flow balancing increased fairness.
>>
>> Scenario 2: One sender to one receiver, through 3 routes (node1,...
>> node2). With 6 or 16 flows the goodput was the same for all, but
>> fairness was 1.8, 1.5 and 1.2 respectively. Interestingly, the worst
>> case fairness out of 10 runs were 2.2, 1.8 and 1.4 repectively. That is,
>> prob=20,mode=1 improved average and worst case fairness.
>I am wondering if we can build better API with routing layer to
>implement this type of feature, instead of creeping the tx_rehashing
>logic scatter in TCP. For example, we call dst_negative_advice on TCP
>write timeouts.

Not sure. The route is not necessarily bad, may be temporarily congested
or they may all be congested. If all we want to do is change the txhash
(unlike dst_negative_advice), then calling a tx_rehashing function may
be the appropriate call.
 
>
>On the patch itself, it seems aggressive to (attempt to) rehash every
>post-RTO retranmission. Also you can just use ca_state (==CA_Loss) to
>identify post-RTO retransmission directly.

Thanks, I will add the test.

>
>is this an implementation of the Flow Bender ?
>https://urldefense.proofpoint.com/v2/url?u=http-3A__dl.acm.org_citation.cf
>m-3Fid-3D2674985&d=DQIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_
>g&m=Q4nONH7kQ5AvQguw9UxpcHd79jfdDdrXj1YSJs7Ezhk&s=MA4fWBLMTGgRS0eGvBjxf7BJ
>Ol3-oxAzZDEYUG4cE-s&e=

Part of flow bender, although there are also some similarities to flowlet
switching.

>
>>
>> Scenario 3: One sender to one receiver, 2 routes, one route drops 50% of
>> the packets. With 7 flows, goodput was the same 1.1G, but fairness was
>> 1.8, 2.0 and 2.1 respectively. That is, if there is a bad route, then
>> balancing, which does more re-routes, is less fair.
>>
>> Signed-off-by: Lawrence Brakmo <brakmo@...com>
>> ---
>>  Documentation/networking/ip-sysctl.txt | 15 +++++++++++++++
>>  include/linux/tcp.h                    |  4 +++-
>>  include/net/tcp.h                      |  2 ++
>>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++++
>>  net/ipv4/tcp_input.c                   | 10 ++++++++++
>>  net/ipv4/tcp_output.c                  | 23 ++++++++++++++++++++++-
>>  net/ipv4/tcp_timer.c                   |  4 ++++
>>  7 files changed, 74 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/networking/ip-sysctl.txt
>>b/Documentation/networking/ip-sysctl.txt
>> index 3db8c67..87a984c 100644
>> --- a/Documentation/networking/ip-sysctl.txt
>> +++ b/Documentation/networking/ip-sysctl.txt
>> @@ -472,6 +472,21 @@ tcp_max_reordering - INTEGER
>>         if paths are using per packet load balancing (like bonding rr
>>mode)
>>         Default: 300
>>
>> +tcp_retrans_txhash_mode - INTEGER
>> +       If zero, disable txhash recalculation due to non-RTO
>>retransmissions
>> +       after an RTO. The idea is that broken paths will trigger an RTO
>>and
>> +       we don't want going back to that path due to standard
>>retransmissons
>> +       (flow balancing). The drawback is that balancing is less robust.
>> +       If greater than zero, can always (probabilistically) recalculate
>> +       txhash after non-RTO retransmissions.
>> +
>> +tcp_retrans_txhash_prob - INTEGER
>> +       Probability [0 to 100] that we will recalculate txhash when a
>> +       packet is resent not due to RTO (for RTO txhash is always
>>recalculated).
>> +       The recalculation of the txhash may be delayed to decrease the
>> +       likelihood that reordering will trigger retransmissons.
>> +       The purpose is to help balance the flows among the possible
>>paths.
>> +
>>  tcp_retrans_collapse - BOOLEAN
>>         Bug-to-bug compatibility with some broken printers.
>>         On retransmit try to send bigger packets to work around bugs in
>> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
>> index a17ae7b..e0e3b7d 100644
>> --- a/include/linux/tcp.h
>> +++ b/include/linux/tcp.h
>> @@ -214,7 +214,9 @@ struct tcp_sock {
>>         } rack;
>>         u16     advmss;         /* Advertised MSS
>>*/
>>         u8      rate_app_limited:1,  /* rate_{delivered,interval_us}
>>limited? */
>> -               unused:7;
>> +               txhash_rto:1,   /* If set, don't do flow balancing
>>*/
>> +               txhash_want:1,  /* We want to change txhash when safe
>>*/
>> +               unused:5;
>>         u8      nonagle     : 4,/* Disable Nagle algorithm?
>>*/
>>                 thin_lto    : 1,/* Use linear timeouts for thin streams
>>*/
>>                 thin_dupack : 1,/* Fast retransmit on first dupack
>>*/
>> diff --git a/include/net/tcp.h b/include/net/tcp.h
>> index f83b7f2..3abd304 100644
>> --- a/include/net/tcp.h
>> +++ b/include/net/tcp.h
>> @@ -271,6 +271,8 @@ extern int sysctl_tcp_autocorking;
>>  extern int sysctl_tcp_invalid_ratelimit;
>>  extern int sysctl_tcp_pacing_ss_ratio;
>>  extern int sysctl_tcp_pacing_ca_ratio;
>> +extern int sysctl_tcp_retrans_txhash_prob;
>> +extern int sysctl_tcp_retrans_txhash_mode;
>>
>>  extern atomic_long_t tcp_memory_allocated;
>>  extern struct percpu_counter tcp_sockets_allocated;
>> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
>> index 1cb67de..00d6f26 100644
>> --- a/net/ipv4/sysctl_net_ipv4.c
>> +++ b/net/ipv4/sysctl_net_ipv4.c
>> @@ -28,6 +28,7 @@
>>  static int zero;
>>  static int one = 1;
>>  static int four = 4;
>> +static int hundred = 100;
>>  static int thousand = 1000;
>>  static int gso_max_segs = GSO_MAX_SEGS;
>>  static int tcp_retr1_max = 255;
>> @@ -624,6 +625,23 @@ static struct ctl_table ipv4_table[] = {
>>                 .proc_handler   = proc_dointvec_ms_jiffies,
>>         },
>>         {
>> +               .procname       = "tcp_retrans_txhash_prob",
>> +               .data           = &sysctl_tcp_retrans_txhash_prob,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = &zero,
>> +               .extra2         = &hundred,
>> +       },
>> +       {
>> +               .procname       = "tcp_retrans_txhash_mode",
>> +               .data           = &sysctl_tcp_retrans_txhash_mode,
>> +               .maxlen         = sizeof(int),
>> +               .mode           = 0644,
>> +               .proc_handler   = proc_dointvec_minmax,
>> +               .extra1         = &zero,
>> +       },
>> +       {
>>                 .procname       = "icmp_msgs_per_sec",
>>                 .data           = &sysctl_icmp_msgs_per_sec,
>>                 .maxlen         = sizeof(int),
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index a27b9c0..fed5366 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -101,6 +101,9 @@ int sysctl_tcp_moderate_rcvbuf __read_mostly = 1;
>>  int sysctl_tcp_early_retrans __read_mostly = 3;
>>  int sysctl_tcp_invalid_ratelimit __read_mostly = HZ/2;
>>
>> +int sysctl_tcp_retrans_txhash_prob __read_mostly;
>> +int sysctl_tcp_retrans_txhash_mode __read_mostly;
>> +
>>  #define FLAG_DATA              0x01 /* Incoming frame contained data.
>>        */
>>  #define FLAG_WIN_UPDATE                0x02 /* Incoming ACK was a
>>window update.       */
>>  #define FLAG_DATA_ACKED                0x04 /* This ACK acknowledged
>>new data.         */
>> @@ -3674,6 +3677,13 @@ static int tcp_ack(struct sock *sk, const struct
>>sk_buff *skb, int flag)
>>         flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una,
>>&acked,
>>                                     &sack_state, &now);
>>
>> +       /* Check if we should set txhash (would not cause reordering) */
>> +       if (tp->txhash_want &&
>> +           (tp->packets_out - tp->sacked_out) < tp->reordering) {
>> +               sk_set_txhash(sk);
>> +               tp->txhash_want = 0;
>> +       }
>> +
>>         if (tcp_ack_is_dubious(sk, flag)) {
>>                 is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED |
>>FLAG_NOT_DUP));
>>                 tcp_fastretrans_alert(sk, acked, is_dupack, &flag,
>>&rexmit);
>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>> index 896e9df..58490ac 100644
>> --- a/net/ipv4/tcp_output.c
>> +++ b/net/ipv4/tcp_output.c
>> @@ -2738,9 +2738,30 @@ int tcp_retransmit_skb(struct sock *sk, struct
>>sk_buff *skb, int segs)
>>                 tp->retrans_out += tcp_skb_pcount(skb);
>>
>>                 /* Save stamp of the first retransmit. */
>> -               if (!tp->retrans_stamp)
>> +               if (!tp->retrans_stamp) {
>>                         tp->retrans_stamp = tcp_skb_timestamp(skb);
>>
>> +                       /* Determine if we should reset hash, only done
>>once
>> +                        * per recovery
>> +                        */
>> +                       if ((!tp->txhash_rto ||
>> +                            sysctl_tcp_retrans_txhash_mode > 0) &&
>> +                           sk->sk_txhash &&
>> +                           (prandom_u32_max(100) <
>> +                            sysctl_tcp_retrans_txhash_prob)) {
>> +                               /* If not too much reordering, or RTT is
>> +                                * small enough that we don't care about
>> +                                * reordering, then change it now.
>> +                                * Else, wait until it is safe.
>> +                                */
>> +                               if ((tp->packets_out - tp->sacked_out) <
>> +                                   tp->reordering)
>I don't parse this logic ... suppose reordering is 100 (not uncommon
>today due to the last packet being delivered slightly earlier than the
>rest), and cwnd==packets_out =~200,we only want to rehash until half
>of the packets are sacked, so we are still rehashing even when
>reordering is heavy?

In your scenario, there would be no re-hashing until sacked_out is 101
((packets_out - sacked_out) < 100). This code would mark txhash_want.
Then when an ACK is received and the conditional is true, txhash
would be changed.

Now, the test does not prevent retransmissions due to reordering in all
cases, but hopefully in most. I will also add the test you recommended,
checking for CA_Loss, to prevent too much re-hashing.

>
>also where do we check RTT is small?

The RTT comment is left over from a previous version, I will remove it.

>
>> +                                       sk_set_txhash(sk);
>> +                               else
>> +                                       tp->txhash_want = 1;
>> +                       }
>> +               }
>> +
>>         } else if (err != -EBUSY) {
>>                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPRETRANSFAIL);
>>         }
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 3ea1cf8..e66baad 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -186,6 +186,8 @@ static int tcp_write_timeout(struct sock *sk)
>>
>>         if ((1 << sk->sk_state) & (TCPF_SYN_SENT | TCPF_SYN_RECV)) {
>>                 if (icsk->icsk_retransmits) {
>> +                       tp->txhash_rto = 1;
>> +                       tp->txhash_want = 0;
>>                         dst_negative_advice(sk);
>>                         if (tp->syn_fastopen || tp->syn_data)
>>                                 tcp_fastopen_cache_set(sk, 0, NULL,
>>true, 0);
>> @@ -218,6 +220,8 @@ static int tcp_write_timeout(struct sock *sk)
>>                 } else {
>>                         sk_rethink_txhash(sk);
>>                 }
>> +               tp->txhash_rto = 1;
>> +               tp->txhash_want = 0;
>>
>>                 retry_until = net->ipv4.sysctl_tcp_retries2;
>>                 if (sock_flag(sk, SOCK_DEAD)) {
>> --
>> 2.9.3
>>