netdev - Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=eXH1HEXEWiAnUTT65i9=M=j1W1v3MQFSxvk2xF_TNLZg@mail.gmail.com>
Date:	Mon, 14 Mar 2016 14:54:47 -0700
From:	Yuchung Cheng <ycheng@...gle.com>
To:	Bendik Rønning Opstad <bro.devel@...il.com>
Cc:	"David S. Miller" <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>,
	Eric Dumazet <eric.dumazet@...il.com>,
	Neal Cardwell <ncardwell@...gle.com>,
	Andreas Petlund <apetlund@...ula.no>,
	Carsten Griwodz <griff@...ula.no>,
	Pål Halvorsen <paalh@...ula.no>,
	Jonas Markussen <jonassm@....uio.no>,
	Kristian Evensen <kristian.evensen@...il.com>,
	Kenneth Klette Jonassen <kennetkl@....uio.no>
Subject: Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)

On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
<bro.devel@...il.com> wrote:
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
>
> Latency-sensitive applications or services, such as online games,
> remote control systems, and VoIP, produce traffic with thin-stream
> characteristics, characterized by small packets and relatively high
> inter-transmission times (ITT). When experiencing packet loss, such
> latency-sensitive applications are heavily penalized by the need to
> retransmit lost packets, which increases the latency by a minimum of
> one RTT for the lost packet. Packets coming after a lost packet are
> held back due to head-of-line blocking, causing increased delays for
> all data segments until the lost packet has been retransmitted.
>
> RDB enables a TCP sender to bundle redundant (already sent) data with
> TCP packets containing small segments of new data. By resending
> un-ACKed data from the output queue in packets with new data, RDB
> reduces the need to retransmit data segments on connections
> experiencing sporadic packet loss. By avoiding a retransmit, RDB
> evades the latency increase of at least one RTT for the lost packet,
> as well as alleviating head-of-line blocking for the packets following
> the lost packet. This makes the TCP connection more resistant to
> latency fluctuations, and reduces the application layer latency
> significantly in lossy environments.
>
> Main functionality added:
>
>   o When a packet is scheduled for transmission, RDB builds and
>     transmits a new SKB containing both the unsent data as well as
>     data of previously sent packets from the TCP output queue.
>
>   o RDB will only be used for streams classified as thin by the
>     function tcp_stream_is_thin_dpifl(). This enforces a lower bound
>     on the ITT for streams that may benefit from RDB, controlled by
>     the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.
>
>   o Loss detection of hidden loss events: When bundling redundant data
>     with each packet, packet loss can be hidden from the TCP engine due
>     to lack of dupACKs. This is because the loss is "repaired" by the
>     redundant data in the packet coming after the lost packet. Based on
>     incoming ACKs, such hidden loss events are detected, and CWR state
>     is entered.
>
> RDB can be enabled on a connection with the socket option TCP_RDB, or
> on all new connections by setting the sysctl variable
> net.ipv4.tcp_rdb=1
>
> Cc: Andreas Petlund <apetlund@...ula.no>
> Cc: Carsten Griwodz <griff@...ula.no>
> Cc: Pål Halvorsen <paalh@...ula.no>
> Cc: Jonas Markussen <jonassm@....uio.no>
> Cc: Kristian Evensen <kristian.evensen@...il.com>
> Cc: Kenneth Klette Jonassen <kennetkl@....uio.no>
> Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@...il.com>
> ---
>  Documentation/networking/ip-sysctl.txt |  15 +++
>  include/linux/skbuff.h                 |   1 +
>  include/linux/tcp.h                    |   3 +-
>  include/net/tcp.h                      |  15 +++
>  include/uapi/linux/tcp.h               |   1 +
>  net/core/skbuff.c                      |   2 +-
>  net/ipv4/Makefile                      |   3 +-
>  net/ipv4/sysctl_net_ipv4.c             |  25 ++++
>  net/ipv4/tcp.c                         |  14 +-
>  net/ipv4/tcp_input.c                   |   3 +
>  net/ipv4/tcp_output.c                  |  48 ++++---
>  net/ipv4/tcp_rdb.c                     | 228 +++++++++++++++++++++++++++++++++
>  12 files changed, 335 insertions(+), 23 deletions(-)
>  create mode 100644 net/ipv4/tcp_rdb.c
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 6a92b15..8f3f3bf 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
>         calculated, which is used to classify whether a stream is thin.
>         Default: 10000
>
> +tcp_rdb - BOOLEAN
> +       Enable RDB for all new TCP connections.
  Please describe RDB briefly, perhaps with a pointer to your paper.
   I suggest have three level of controls:
   0: disable RDB completely
   1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
options
   2: enable RDB on all thin-stream conn. by default

   currently it only provides mode 1 and 2. but there may be cases where
   the administrator wants to disallow it (e.g., broken middle-boxes).

> +       Default: 0
> +
> +tcp_rdb_max_bytes - INTEGER
> +       Enable restriction on how many bytes an RDB packet can contain.
> +       This is the total amount of payload including the new unsent data.
> +       Default: 0
> +
> +tcp_rdb_max_packets - INTEGER
> +       Enable restriction on how many previous packets in the output queue
> +       RDB may include data from. A value of 1 will restrict bundling to
> +       only the data from the last packet that was sent.
> +       Default: 1
 why two metrics on redundancy? It also seems better to
 allow individual socket to select the redundancy level (e.g.,
 setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
 This requires more bits in tcp_sock but 2-3 more is suffice.
/

> +
>  tcp_limit_output_bytes - INTEGER
>         Controls TCP Small Queue limit per tcp socket.
>         TCP bulk sender tends to increase packets in flight until it
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 797cefb..0f2c9d1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
>  void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
>  void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
>  int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
>  int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
>  int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
>  __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index bcbf51d..c84de15 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -207,9 +207,10 @@ struct tcp_sock {
>         } rack;
>         u16     advmss;         /* Advertised MSS                       */
>         u8      unused;
> -       u8      nonagle     : 4,/* Disable Nagle algorithm?             */
> +       u8      nonagle     : 3,/* Disable Nagle algorithm?             */
>                 thin_lto    : 1,/* Use linear timeouts for thin streams */
>                 thin_dupack : 1,/* Fast retransmit on first dupack      */
> +               rdb         : 1,/* Redundant Data Bundling enabled      */
>                 repair      : 1,
>                 frto        : 1;/* F-RTO (RFC5682) activated in CA_Loss */
>         u8      repair_queue;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d38eae9..2d42f4a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_thin_linear_timeouts;
>  extern int sysctl_tcp_thin_dupack;
>  extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
> +extern int sysctl_tcp_rdb;
> +extern int sysctl_tcp_rdb_max_bytes;
> +extern int sysctl_tcp_rdb_max_packets;
>  extern int sysctl_tcp_early_retrans;
>  extern int sysctl_tcp_limit_output_bytes;
>  extern int sysctl_tcp_challenge_ack_limit;
> @@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
>  bool tcp_may_send_now(struct sock *sk);
>  int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
>  int tcp_retransmit_skb(struct sock *, struct sk_buff *);
> +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> +                    gfp_t gfp_mask);
>  void tcp_retransmit_timer(struct sock *sk);
>  void tcp_xmit_retransmit_queue(struct sock *);
>  void tcp_simple_retransmit(struct sock *);
> @@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk);
>  void tcp_send_delayed_ack(struct sock *sk);
>  void tcp_send_loss_probe(struct sock *sk);
>  bool tcp_schedule_loss_probe(struct sock *sk);
> +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
>
>  /* tcp_input.c */
>  void tcp_resume_early_retransmit(struct sock *sk);
> @@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk);
>  void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
>  void tcp_fin(struct sock *sk);
>
> +/* tcp_rdb.c */
> +void tcp_rdb_ack_event(struct sock *sk, u32 flags);
> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> +                        unsigned int mss_now, gfp_t gfp_mask);
> +
>  /* tcp_timer.c */
>  void tcp_init_xmit_timers(struct sock *);
>  static inline void tcp_clear_xmit_timers(struct sock *sk)
> @@ -763,6 +774,7 @@ struct tcp_skb_cb {
>         union {
>                 struct {
>                         /* There is space for up to 20 bytes */
> +                       __u32 rdb_start_seq; /* Start seq of rdb data */
>                 } tx;   /* only used for outgoing skbs */
>                 union {
>                         struct inet_skb_parm    h4;
> @@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)                    \
>         skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>
> +#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk)            \
> +       skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> +
>  static inline struct sk_buff *tcp_send_head(const struct sock *sk)
>  {
>         return sk->sk_send_head;
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index fe95446..6799875 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -115,6 +115,7 @@ enum {
>  #define TCP_CC_INFO            26      /* Get Congestion Control (optional) info */
>  #define TCP_SAVE_SYN           27      /* Record SYN headers for new connections */
>  #define TCP_SAVED_SYN          28      /* Get SYN headers recorded for connection */
> +#define TCP_RDB                        29      /* Enable Redundant Data Bundling mechanism */
>
>  struct tcp_repair_opt {
>         __u32   opt_code;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7af7ec6..50bc5b0 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
>         skb->inner_mac_header += off;
>  }
>
> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
>  {
>         __copy_skb_header(new, old);
>
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index bfa1336..459048c 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -12,7 +12,8 @@ obj-y     := route.o inetpeer.o protocol.o \
>              tcp_offload.o datagram.o raw.o udp.o udplite.o \
>              udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
>              fib_frontend.o fib_semantics.o fib_trie.o \
> -            inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
> +            inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
> +            tcp_rdb.o
>
>  obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
>  obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index f04320a..43b4390 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = {
>                 .extra1         = &tcp_thin_dpifl_itt_lower_bound_min,
>         },
>         {
> +               .procname       = "tcp_rdb",
> +               .data           = &sysctl_tcp_rdb,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +               .extra2         = &one,
> +       },
> +       {
> +               .procname       = "tcp_rdb_max_bytes",
> +               .data           = &sysctl_tcp_rdb_max_bytes,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +       },
> +       {
> +               .procname       = "tcp_rdb_max_packets",
> +               .data           = &sysctl_tcp_rdb_max_packets,
> +               .maxlen         = sizeof(int),
> +               .mode           = 0644,
> +               .proc_handler   = proc_dointvec_minmax,
> +               .extra1         = &zero,
> +       },
> +       {
>                 .procname       = "tcp_early_retrans",
>                 .data           = &sysctl_tcp_early_retrans,
>                 .maxlen         = sizeof(int),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 8421f3d..b53d4cb 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
>
>  int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
>
> +int sysctl_tcp_rdb __read_mostly;
> +
>  struct percpu_counter tcp_orphan_count;
>  EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> @@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk)
>         u64_stats_init(&tp->syncp);
>
>         tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
> +       tp->rdb = sysctl_tcp_rdb;
>         tcp_enable_early_retrans(tp);
>         tcp_assign_congestion_control(sk);
>
> @@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>                 }
>                 break;
>
> +       case TCP_RDB:
> +               if (val < 0 || val > 1)
> +                       err = -EINVAL;
> +               else
> +                       tp->rdb = val;
> +               break;
> +
>         case TCP_REPAIR:
>                 if (!tcp_can_repair_sock(sk))
>                         err = -EPERM;
> @@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
>         case TCP_THIN_DUPACK:
>                 val = tp->thin_dupack;
>                 break;
> -
> +       case TCP_RDB:
> +               val = tp->rdb;
> +               break;
>         case TCP_REPAIR:
>                 val = tp->repair;
>                 break;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e6e65f7..7b52ce4 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
>
>         if (icsk->icsk_ca_ops->in_ack_event)
>                 icsk->icsk_ca_ops->in_ack_event(sk, flags);
> +
> +       if (unlikely(tcp_sk(sk)->rdb))
> +               tcp_rdb_ack_event(sk, flags);
>  }
>
>  /* Congestion control has updated the cwnd already. So if we're in
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 7d2c7a4..6f92fae 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -897,8 +897,8 @@ out:
>   * We are working here with either a clone of the original
>   * SKB, or a fresh unique copy made by the retransmit engine.
>   */
> -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> -                           gfp_t gfp_mask)
> +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> +                    gfp_t gfp_mask)
>  {
>         const struct inet_connection_sock *icsk = inet_csk(sk);
>         struct inet_sock *inet;
> @@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
>                                 break;
>                 }
>
> -               if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> +               if (unlikely(tcp_sk(sk)->rdb)) {
> +                       if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
> +                               break;
> +               } else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
>                         break;
> -
> +               }
>  repair:
>                 /* Advance the send_head.  This one is sent out.
>                  * This call will increment packets_out.
> @@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk)
>         return window;
>  }
>
> +/**
> + * tcp_skb_append_data() - copy the linear data from an SKB to the end
> + *                         of another and update end sequence number
> + *                         and checksum
> + * @from_skb: the SKB to copy data from
> + * @to_skb: the SKB to copy data to
> + */
> +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
> +{
> +       skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
> +                                 from_skb->len);
> +       TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
> +
> +       if (from_skb->ip_summed == CHECKSUM_PARTIAL)
> +               to_skb->ip_summed = CHECKSUM_PARTIAL;
> +
> +       if (to_skb->ip_summed != CHECKSUM_PARTIAL)
> +               to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
> +                                             to_skb->len);
> +}
> +
>  /* Collapses two adjacent SKB's during retransmission. */
>  static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>         struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
> -       int skb_size, next_skb_size;
> -
> -       skb_size = skb->len;
> -       next_skb_size = next_skb->len;
>
>         BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
>
> @@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>
>         tcp_unlink_write_queue(next_skb, sk);
>
> -       skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
> -                                 next_skb_size);
> -
> -       if (next_skb->ip_summed == CHECKSUM_PARTIAL)
> -               skb->ip_summed = CHECKSUM_PARTIAL;
> -
> -       if (skb->ip_summed != CHECKSUM_PARTIAL)
> -               skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
> -
> -       /* Update sequence range on original skb. */
> -       TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
> +       tcp_skb_append_data(next_skb, skb);
>
>         /* Merge over control information. This moves PSH/FIN etc. over */
>         TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
> new file mode 100644
> index 0000000..2b37957
> --- /dev/null
> +++ b/net/ipv4/tcp_rdb.c
> @@ -0,0 +1,228 @@
> +#include <linux/skbuff.h>
> +#include <net/tcp.h>
> +
> +int sysctl_tcp_rdb_max_bytes __read_mostly;
> +int sysctl_tcp_rdb_max_packets __read_mostly = 1;
> +
> +/**
> + * rdb_detect_loss() - perform RDB loss detection by analysing ACKs
> + * @sk: socket
> + *
> + * Traverse the output queue and check if the ACKed packet is an RDB
> + * packet and if the redundant data covers one or more un-ACKed SKBs.
> + * If the incoming ACK acknowledges multiple SKBs, we can presume
> + * packet loss has occurred.
> + *
> + * We can infer packet loss this way because we can expect one ACK per
> + * transmitted data packet, as delayed ACKs are disabled when a host
> + * receives packets where the sequence number is not the expected
> + * sequence number.
> + *
> + * Return: The number of packets that are presumed to be lost
> + */
> +static unsigned int rdb_detect_loss(struct sock *sk)
> +{
> +       struct sk_buff *skb, *tmp;
> +       struct tcp_skb_cb *scb;
> +       u32 seq_acked = tcp_sk(sk)->snd_una;
> +       unsigned int packets_lost = 0;
> +
> +       tcp_for_write_queue(skb, sk) {
> +               if (skb == tcp_send_head(sk))
> +                       break;
> +
> +               scb = TCP_SKB_CB(skb);
> +               /* The ACK acknowledges parts of the data in this SKB.
> +                * Can be caused by:
> +                * - TSO: We abort as RDB is not used on SKBs split across
> +                *        multiple packets on lower layers as these are greater
> +                *        than one MSS.
> +                * - Retrans collapse: We've had a retrans, so loss has already
> +                *                     been detected.
> +                */
> +               if (after(scb->end_seq, seq_acked))
> +                       break;
> +               else if (scb->end_seq != seq_acked)
> +                       continue;
> +
> +               /* We have found the ACKed packet */
> +
> +               /* This packet was sent with no redundant data, or no prior
> +                * un-ACKed SKBs is in the output queue, so break here.
> +                */
> +               if (scb->tx.rdb_start_seq == scb->seq ||
> +                   skb_queue_is_first(&sk->sk_write_queue, skb))
> +                       break;
> +               /* Find number of prior SKBs whose data was bundled in this
> +                * (ACKed) SKB. We presume any redundant data covering previous
> +                * SKB's are due to loss. (An exception would be reordering).
> +                */
> +               skb = skb->prev;
> +               tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> +                       if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
> +                               break;
> +                       packets_lost++;
since we only care if there is packet loss or not, we can return early here?

> +               }
> +               break;
> +       }
> +       return packets_lost;
> +}
> +
> +/**
> + * tcp_rdb_ack_event() - initiate RDB loss detection
> + * @sk: socket
> + * @flags: flags
> + */
> +void tcp_rdb_ack_event(struct sock *sk, u32 flags)
flags are not used
> +{
> +       if (rdb_detect_loss(sk))
> +               tcp_enter_cwr(sk);
> +}
> +
> +/**
> + * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent
> + *                   data to the linear page buffer
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission in the output engine
> + * @first_skb: the first SKB in the output queue to be bundled
> + * @bytes_in_rdb_skb: the total number of data bytes for the new
> + *                    rdb_skb (NEW + Redundant)
> + * @gfp_mask: gfp_t allocation
> + *
> + * Return: A new SKB containing redundant data, or NULL if memory
> + *         allocation failed
> + */
> +static struct sk_buff *rdb_build_skb(const struct sock *sk,
> +                                    struct sk_buff *xmit_skb,
> +                                    struct sk_buff *first_skb,
> +                                    u32 bytes_in_rdb_skb,
> +                                    gfp_t gfp_mask)
> +{
> +       struct sk_buff *rdb_skb, *tmp_skb = first_skb;
> +
> +       rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
> +                                     (int)bytes_in_rdb_skb,
> +                                     gfp_mask, false);
> +       if (!rdb_skb)
> +               return NULL;
> +       copy_skb_header(rdb_skb, xmit_skb);
> +       rdb_skb->ip_summed = xmit_skb->ip_summed;
> +       TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
> +
> +       /* Start on first_skb and append payload from each SKB in the output
> +        * queue onto rdb_skb until we reach xmit_skb.
> +        */
> +       tcp_for_write_queue_from(tmp_skb, sk) {
> +               tcp_skb_append_data(tmp_skb, rdb_skb);
> +
> +               /* We reached xmit_skb, containing the unsent data */
> +               if (tmp_skb == xmit_skb)
> +                       break;
> +       }
> +       return rdb_skb;
> +}
> +
> +/**
> + * rdb_can_bundle_test() - test if redundant data can be bundled
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission by the output engine
> + * @max_payload: the maximum allowed payload bytes for the RDB SKB
> + * @bytes_in_rdb_skb: store the total number of payload bytes in the
> + *                    RDB SKB if bundling can be performed
> + *
> + * Traverse the output queue and check if any un-acked data may be
> + * bundled.
> + *
> + * Return: The first SKB to be in the bundle, or NULL if no bundling
> + */
> +static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
> +                                          struct sk_buff *xmit_skb,
> +                                          unsigned int max_payload,
> +                                          u32 *bytes_in_rdb_skb)
> +{
> +       struct sk_buff *first_to_bundle = NULL;
> +       struct sk_buff *tmp, *skb = xmit_skb->prev;
> +       u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
> +       u32 total_payload = xmit_skb->len;
> +
> +       if (sysctl_tcp_rdb_max_bytes)
> +               max_payload = min_t(unsigned int, max_payload,
> +                                   sysctl_tcp_rdb_max_bytes);
> +
> +       /* We start at xmit_skb->prev, and go backwards */
> +       tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> +               /* Including data from this SKB would exceed payload limit */
> +               if ((total_payload + skb->len) > max_payload)
> +                       break;
> +
> +               if (sysctl_tcp_rdb_max_packets &&
> +                   (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
> +                       break;
> +
> +               total_payload += skb->len;
> +               skbs_in_bundle_count++;
> +               first_to_bundle = skb;
> +       }
> +       *bytes_in_rdb_skb = total_payload;
> +       return first_to_bundle;
> +}
> +
> +/**
> + * tcp_transmit_rdb_skb() - try to create and send an RDB packet
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission by the output engine
> + * @mss_now: current mss value
> + * @gfp_mask: gfp_t allocation
> + *
> + * If an RDB packet could not be created and sent, transmit the
> + * original unmodified SKB (xmit_skb).
> + *
> + * Return: 0 if successfully sent packet, else error from
> + *         tcp_transmit_skb
> + */
> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> +                        unsigned int mss_now, gfp_t gfp_mask)
> +{
> +       struct sk_buff *rdb_skb = NULL;
> +       struct sk_buff *first_to_bundle;
> +       u32 bytes_in_rdb_skb = 0;
> +
> +       /* How we detect that RDB was used. When equal, no RDB data was sent */
> +       TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;

> +
> +       if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
During loss recovery tcp inflight fluctuates and would like to trigger
this check even for non-thin-stream connections. Since the loss
already occurs, RDB can only take advantage from limited-transmit,
which it likely does not have (b/c its a thin-stream). It might be
checking if the state is open.

> +               goto xmit_default;
> +
> +       /* No bundling if first in queue, or on FIN packet */
> +       if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
> +           (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
seems there are still benefit to bundle packets up to FIN?

> +               goto xmit_default;
> +
> +       /* Find number of (previous) SKBs to get data from */
> +       first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
> +                                             &bytes_in_rdb_skb);
> +       if (!first_to_bundle)
> +               goto xmit_default;
> +
> +       /* Create an SKB that contains redundant data starting from
> +        * first_to_bundle.
> +        */
> +       rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
> +                               bytes_in_rdb_skb, gfp_mask);
> +       if (!rdb_skb)
> +               goto xmit_default;
> +
> +       /* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing
> +        * the yet unsent data. Normally this would be done by
> +        * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's
> +        * timestamp will not be touched.
> +        */
> +       skb_mstamp_get(&xmit_skb->skb_mstamp);
> +       rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
> +       return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
> +
> +xmit_default:
> +       /* Transmit the unmodified SKB from output queue */
> +       return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
> +}
> --
> 1.9.1
>

since RDB will cause DSACKs, and we only blindly count DSACKs to
perform CWND undo. How does RDB handle that false positives?