[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAK6E8=eXH1HEXEWiAnUTT65i9=M=j1W1v3MQFSxvk2xF_TNLZg@mail.gmail.com>
Date: Mon, 14 Mar 2016 14:54:47 -0700
From: Yuchung Cheng <ycheng@...gle.com>
To: Bendik Rønning Opstad <bro.devel@...il.com>
Cc: "David S. Miller" <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>,
Eric Dumazet <eric.dumazet@...il.com>,
Neal Cardwell <ncardwell@...gle.com>,
Andreas Petlund <apetlund@...ula.no>,
Carsten Griwodz <griff@...ula.no>,
Pål Halvorsen <paalh@...ula.no>,
Jonas Markussen <jonassm@....uio.no>,
Kristian Evensen <kristian.evensen@...il.com>,
Kenneth Klette Jonassen <kennetkl@....uio.no>
Subject: Re: [PATCH v6 net-next 2/2] tcp: Add Redundant Data Bundling (RDB)
On Thu, Mar 3, 2016 at 10:06 AM, Bendik Rønning Opstad
<bro.devel@...il.com> wrote:
>
> Redundant Data Bundling (RDB) is a mechanism for TCP aimed at reducing
> the latency for applications sending time-dependent data.
>
> Latency-sensitive applications or services, such as online games,
> remote control systems, and VoIP, produce traffic with thin-stream
> characteristics, characterized by small packets and relatively high
> inter-transmission times (ITT). When experiencing packet loss, such
> latency-sensitive applications are heavily penalized by the need to
> retransmit lost packets, which increases the latency by a minimum of
> one RTT for the lost packet. Packets coming after a lost packet are
> held back due to head-of-line blocking, causing increased delays for
> all data segments until the lost packet has been retransmitted.
>
> RDB enables a TCP sender to bundle redundant (already sent) data with
> TCP packets containing small segments of new data. By resending
> un-ACKed data from the output queue in packets with new data, RDB
> reduces the need to retransmit data segments on connections
> experiencing sporadic packet loss. By avoiding a retransmit, RDB
> evades the latency increase of at least one RTT for the lost packet,
> as well as alleviating head-of-line blocking for the packets following
> the lost packet. This makes the TCP connection more resistant to
> latency fluctuations, and reduces the application layer latency
> significantly in lossy environments.
>
> Main functionality added:
>
> o When a packet is scheduled for transmission, RDB builds and
> transmits a new SKB containing both the unsent data as well as
> data of previously sent packets from the TCP output queue.
>
> o RDB will only be used for streams classified as thin by the
> function tcp_stream_is_thin_dpifl(). This enforces a lower bound
> on the ITT for streams that may benefit from RDB, controlled by
> the sysctl variable net.ipv4.tcp_thin_dpifl_itt_lower_bound.
>
> o Loss detection of hidden loss events: When bundling redundant data
> with each packet, packet loss can be hidden from the TCP engine due
> to lack of dupACKs. This is because the loss is "repaired" by the
> redundant data in the packet coming after the lost packet. Based on
> incoming ACKs, such hidden loss events are detected, and CWR state
> is entered.
>
> RDB can be enabled on a connection with the socket option TCP_RDB, or
> on all new connections by setting the sysctl variable
> net.ipv4.tcp_rdb=1
>
> Cc: Andreas Petlund <apetlund@...ula.no>
> Cc: Carsten Griwodz <griff@...ula.no>
> Cc: Pål Halvorsen <paalh@...ula.no>
> Cc: Jonas Markussen <jonassm@....uio.no>
> Cc: Kristian Evensen <kristian.evensen@...il.com>
> Cc: Kenneth Klette Jonassen <kennetkl@....uio.no>
> Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@...il.com>
> ---
> Documentation/networking/ip-sysctl.txt | 15 +++
> include/linux/skbuff.h | 1 +
> include/linux/tcp.h | 3 +-
> include/net/tcp.h | 15 +++
> include/uapi/linux/tcp.h | 1 +
> net/core/skbuff.c | 2 +-
> net/ipv4/Makefile | 3 +-
> net/ipv4/sysctl_net_ipv4.c | 25 ++++
> net/ipv4/tcp.c | 14 +-
> net/ipv4/tcp_input.c | 3 +
> net/ipv4/tcp_output.c | 48 ++++---
> net/ipv4/tcp_rdb.c | 228 +++++++++++++++++++++++++++++++++
> 12 files changed, 335 insertions(+), 23 deletions(-)
> create mode 100644 net/ipv4/tcp_rdb.c
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 6a92b15..8f3f3bf 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -716,6 +716,21 @@ tcp_thin_dpifl_itt_lower_bound - INTEGER
> calculated, which is used to classify whether a stream is thin.
> Default: 10000
>
> +tcp_rdb - BOOLEAN
> + Enable RDB for all new TCP connections.
Please describe RDB briefly, perhaps with a pointer to your paper.
I suggest have three level of controls:
0: disable RDB completely
1: enable indiv. thin-stream conn. to use RDB via TCP_RDB socket
options
2: enable RDB on all thin-stream conn. by default
currently it only provides mode 1 and 2. but there may be cases where
the administrator wants to disallow it (e.g., broken middle-boxes).
> + Default: 0
> +
> +tcp_rdb_max_bytes - INTEGER
> + Enable restriction on how many bytes an RDB packet can contain.
> + This is the total amount of payload including the new unsent data.
> + Default: 0
> +
> +tcp_rdb_max_packets - INTEGER
> + Enable restriction on how many previous packets in the output queue
> + RDB may include data from. A value of 1 will restrict bundling to
> + only the data from the last packet that was sent.
> + Default: 1
why two metrics on redundancy? It also seems better to
allow individual socket to select the redundancy level (e.g.,
setsockopt TCP_RDB=3 means <=3 pkts per bundle) vs a global setting.
This requires more bits in tcp_sock but 2-3 more is suffice.
/
> +
> tcp_limit_output_bytes - INTEGER
> Controls TCP Small Queue limit per tcp socket.
> TCP bulk sender tends to increase packets in flight until it
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 797cefb..0f2c9d1 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -2927,6 +2927,7 @@ int zerocopy_sg_from_iter(struct sk_buff *skb, struct iov_iter *frm);
> void skb_free_datagram(struct sock *sk, struct sk_buff *skb);
> void skb_free_datagram_locked(struct sock *sk, struct sk_buff *skb);
> int skb_kill_datagram(struct sock *sk, struct sk_buff *skb, unsigned int flags);
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old);
> int skb_copy_bits(const struct sk_buff *skb, int offset, void *to, int len);
> int skb_store_bits(struct sk_buff *skb, int offset, const void *from, int len);
> __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset, u8 *to,
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index bcbf51d..c84de15 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -207,9 +207,10 @@ struct tcp_sock {
> } rack;
> u16 advmss; /* Advertised MSS */
> u8 unused;
> - u8 nonagle : 4,/* Disable Nagle algorithm? */
> + u8 nonagle : 3,/* Disable Nagle algorithm? */
> thin_lto : 1,/* Use linear timeouts for thin streams */
> thin_dupack : 1,/* Fast retransmit on first dupack */
> + rdb : 1,/* Redundant Data Bundling enabled */
> repair : 1,
> frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */
> u8 repair_queue;
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d38eae9..2d42f4a 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -267,6 +267,9 @@ extern int sysctl_tcp_slow_start_after_idle;
> extern int sysctl_tcp_thin_linear_timeouts;
> extern int sysctl_tcp_thin_dupack;
> extern int sysctl_tcp_thin_dpifl_itt_lower_bound;
> +extern int sysctl_tcp_rdb;
> +extern int sysctl_tcp_rdb_max_bytes;
> +extern int sysctl_tcp_rdb_max_packets;
> extern int sysctl_tcp_early_retrans;
> extern int sysctl_tcp_limit_output_bytes;
> extern int sysctl_tcp_challenge_ack_limit;
> @@ -539,6 +542,8 @@ void __tcp_push_pending_frames(struct sock *sk, unsigned int cur_mss,
> bool tcp_may_send_now(struct sock *sk);
> int __tcp_retransmit_skb(struct sock *, struct sk_buff *);
> int tcp_retransmit_skb(struct sock *, struct sk_buff *);
> +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> + gfp_t gfp_mask);
> void tcp_retransmit_timer(struct sock *sk);
> void tcp_xmit_retransmit_queue(struct sock *);
> void tcp_simple_retransmit(struct sock *);
> @@ -556,6 +561,7 @@ void tcp_send_ack(struct sock *sk);
> void tcp_send_delayed_ack(struct sock *sk);
> void tcp_send_loss_probe(struct sock *sk);
> bool tcp_schedule_loss_probe(struct sock *sk);
> +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb);
>
> /* tcp_input.c */
> void tcp_resume_early_retransmit(struct sock *sk);
> @@ -565,6 +571,11 @@ void tcp_reset(struct sock *sk);
> void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp, struct sk_buff *skb);
> void tcp_fin(struct sock *sk);
>
> +/* tcp_rdb.c */
> +void tcp_rdb_ack_event(struct sock *sk, u32 flags);
> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> + unsigned int mss_now, gfp_t gfp_mask);
> +
> /* tcp_timer.c */
> void tcp_init_xmit_timers(struct sock *);
> static inline void tcp_clear_xmit_timers(struct sock *sk)
> @@ -763,6 +774,7 @@ struct tcp_skb_cb {
> union {
> struct {
> /* There is space for up to 20 bytes */
> + __u32 rdb_start_seq; /* Start seq of rdb data */
> } tx; /* only used for outgoing skbs */
> union {
> struct inet_skb_parm h4;
> @@ -1497,6 +1509,9 @@ static inline struct sk_buff *tcp_write_queue_prev(const struct sock *sk,
> #define tcp_for_write_queue_from_safe(skb, tmp, sk) \
> skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>
> +#define tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) \
> + skb_queue_reverse_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> +
> static inline struct sk_buff *tcp_send_head(const struct sock *sk)
> {
> return sk->sk_send_head;
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index fe95446..6799875 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -115,6 +115,7 @@ enum {
> #define TCP_CC_INFO 26 /* Get Congestion Control (optional) info */
> #define TCP_SAVE_SYN 27 /* Record SYN headers for new connections */
> #define TCP_SAVED_SYN 28 /* Get SYN headers recorded for connection */
> +#define TCP_RDB 29 /* Enable Redundant Data Bundling mechanism */
>
> struct tcp_repair_opt {
> __u32 opt_code;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 7af7ec6..50bc5b0 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -1055,7 +1055,7 @@ static void skb_headers_offset_update(struct sk_buff *skb, int off)
> skb->inner_mac_header += off;
> }
>
> -static void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> +void copy_skb_header(struct sk_buff *new, const struct sk_buff *old)
> {
> __copy_skb_header(new, old);
>
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index bfa1336..459048c 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -12,7 +12,8 @@ obj-y := route.o inetpeer.o protocol.o \
> tcp_offload.o datagram.o raw.o udp.o udplite.o \
> udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
> fib_frontend.o fib_semantics.o fib_trie.o \
> - inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
> + inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
> + tcp_rdb.o
>
> obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
> obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index f04320a..43b4390 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -581,6 +581,31 @@ static struct ctl_table ipv4_table[] = {
> .extra1 = &tcp_thin_dpifl_itt_lower_bound_min,
> },
> {
> + .procname = "tcp_rdb",
> + .data = &sysctl_tcp_rdb,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + .extra2 = &one,
> + },
> + {
> + .procname = "tcp_rdb_max_bytes",
> + .data = &sysctl_tcp_rdb_max_bytes,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + },
> + {
> + .procname = "tcp_rdb_max_packets",
> + .data = &sysctl_tcp_rdb_max_packets,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,
> + .extra1 = &zero,
> + },
> + {
> .procname = "tcp_early_retrans",
> .data = &sysctl_tcp_early_retrans,
> .maxlen = sizeof(int),
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 8421f3d..b53d4cb 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -288,6 +288,8 @@ int sysctl_tcp_autocorking __read_mostly = 1;
>
> int sysctl_tcp_thin_dpifl_itt_lower_bound __read_mostly = TCP_THIN_DPIFL_ITT_LOWER_BOUND_MIN;
>
> +int sysctl_tcp_rdb __read_mostly;
> +
> struct percpu_counter tcp_orphan_count;
> EXPORT_SYMBOL_GPL(tcp_orphan_count);
>
> @@ -407,6 +409,7 @@ void tcp_init_sock(struct sock *sk)
> u64_stats_init(&tp->syncp);
>
> tp->reordering = sock_net(sk)->ipv4.sysctl_tcp_reordering;
> + tp->rdb = sysctl_tcp_rdb;
> tcp_enable_early_retrans(tp);
> tcp_assign_congestion_control(sk);
>
> @@ -2412,6 +2415,13 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> }
> break;
>
> + case TCP_RDB:
> + if (val < 0 || val > 1)
> + err = -EINVAL;
> + else
> + tp->rdb = val;
> + break;
> +
> case TCP_REPAIR:
> if (!tcp_can_repair_sock(sk))
> err = -EPERM;
> @@ -2842,7 +2852,9 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
> case TCP_THIN_DUPACK:
> val = tp->thin_dupack;
> break;
> -
> + case TCP_RDB:
> + val = tp->rdb;
> + break;
> case TCP_REPAIR:
> val = tp->repair;
> break;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index e6e65f7..7b52ce4 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3537,6 +3537,9 @@ static inline void tcp_in_ack_event(struct sock *sk, u32 flags)
>
> if (icsk->icsk_ca_ops->in_ack_event)
> icsk->icsk_ca_ops->in_ack_event(sk, flags);
> +
> + if (unlikely(tcp_sk(sk)->rdb))
> + tcp_rdb_ack_event(sk, flags);
> }
>
> /* Congestion control has updated the cwnd already. So if we're in
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 7d2c7a4..6f92fae 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -897,8 +897,8 @@ out:
> * We are working here with either a clone of the original
> * SKB, or a fresh unique copy made by the retransmit engine.
> */
> -static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> - gfp_t gfp_mask)
> +int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> + gfp_t gfp_mask)
> {
> const struct inet_connection_sock *icsk = inet_csk(sk);
> struct inet_sock *inet;
> @@ -2110,9 +2110,12 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> break;
> }
>
> - if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
> + if (unlikely(tcp_sk(sk)->rdb)) {
> + if (tcp_transmit_rdb_skb(sk, skb, mss_now, gfp))
> + break;
> + } else if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp))) {
> break;
> -
> + }
> repair:
> /* Advance the send_head. This one is sent out.
> * This call will increment packets_out.
> @@ -2439,15 +2442,32 @@ u32 __tcp_select_window(struct sock *sk)
> return window;
> }
>
> +/**
> + * tcp_skb_append_data() - copy the linear data from an SKB to the end
> + * of another and update end sequence number
> + * and checksum
> + * @from_skb: the SKB to copy data from
> + * @to_skb: the SKB to copy data to
> + */
> +void tcp_skb_append_data(struct sk_buff *from_skb, struct sk_buff *to_skb)
> +{
> + skb_copy_from_linear_data(from_skb, skb_put(to_skb, from_skb->len),
> + from_skb->len);
> + TCP_SKB_CB(to_skb)->end_seq = TCP_SKB_CB(from_skb)->end_seq;
> +
> + if (from_skb->ip_summed == CHECKSUM_PARTIAL)
> + to_skb->ip_summed = CHECKSUM_PARTIAL;
> +
> + if (to_skb->ip_summed != CHECKSUM_PARTIAL)
> + to_skb->csum = csum_block_add(to_skb->csum, from_skb->csum,
> + to_skb->len);
> +}
> +
> /* Collapses two adjacent SKB's during retransmission. */
> static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
> - int skb_size, next_skb_size;
> -
> - skb_size = skb->len;
> - next_skb_size = next_skb->len;
>
> BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
>
> @@ -2455,17 +2475,7 @@ static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
>
> tcp_unlink_write_queue(next_skb, sk);
>
> - skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
> - next_skb_size);
> -
> - if (next_skb->ip_summed == CHECKSUM_PARTIAL)
> - skb->ip_summed = CHECKSUM_PARTIAL;
> -
> - if (skb->ip_summed != CHECKSUM_PARTIAL)
> - skb->csum = csum_block_add(skb->csum, next_skb->csum, skb_size);
> -
> - /* Update sequence range on original skb. */
> - TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(next_skb)->end_seq;
> + tcp_skb_append_data(next_skb, skb);
>
> /* Merge over control information. This moves PSH/FIN etc. over */
> TCP_SKB_CB(skb)->tcp_flags |= TCP_SKB_CB(next_skb)->tcp_flags;
> diff --git a/net/ipv4/tcp_rdb.c b/net/ipv4/tcp_rdb.c
> new file mode 100644
> index 0000000..2b37957
> --- /dev/null
> +++ b/net/ipv4/tcp_rdb.c
> @@ -0,0 +1,228 @@
> +#include <linux/skbuff.h>
> +#include <net/tcp.h>
> +
> +int sysctl_tcp_rdb_max_bytes __read_mostly;
> +int sysctl_tcp_rdb_max_packets __read_mostly = 1;
> +
> +/**
> + * rdb_detect_loss() - perform RDB loss detection by analysing ACKs
> + * @sk: socket
> + *
> + * Traverse the output queue and check if the ACKed packet is an RDB
> + * packet and if the redundant data covers one or more un-ACKed SKBs.
> + * If the incoming ACK acknowledges multiple SKBs, we can presume
> + * packet loss has occurred.
> + *
> + * We can infer packet loss this way because we can expect one ACK per
> + * transmitted data packet, as delayed ACKs are disabled when a host
> + * receives packets where the sequence number is not the expected
> + * sequence number.
> + *
> + * Return: The number of packets that are presumed to be lost
> + */
> +static unsigned int rdb_detect_loss(struct sock *sk)
> +{
> + struct sk_buff *skb, *tmp;
> + struct tcp_skb_cb *scb;
> + u32 seq_acked = tcp_sk(sk)->snd_una;
> + unsigned int packets_lost = 0;
> +
> + tcp_for_write_queue(skb, sk) {
> + if (skb == tcp_send_head(sk))
> + break;
> +
> + scb = TCP_SKB_CB(skb);
> + /* The ACK acknowledges parts of the data in this SKB.
> + * Can be caused by:
> + * - TSO: We abort as RDB is not used on SKBs split across
> + * multiple packets on lower layers as these are greater
> + * than one MSS.
> + * - Retrans collapse: We've had a retrans, so loss has already
> + * been detected.
> + */
> + if (after(scb->end_seq, seq_acked))
> + break;
> + else if (scb->end_seq != seq_acked)
> + continue;
> +
> + /* We have found the ACKed packet */
> +
> + /* This packet was sent with no redundant data, or no prior
> + * un-ACKed SKBs is in the output queue, so break here.
> + */
> + if (scb->tx.rdb_start_seq == scb->seq ||
> + skb_queue_is_first(&sk->sk_write_queue, skb))
> + break;
> + /* Find number of prior SKBs whose data was bundled in this
> + * (ACKed) SKB. We presume any redundant data covering previous
> + * SKB's are due to loss. (An exception would be reordering).
> + */
> + skb = skb->prev;
> + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> + if (before(TCP_SKB_CB(skb)->seq, scb->tx.rdb_start_seq))
> + break;
> + packets_lost++;
since we only care if there is packet loss or not, we can return early here?
> + }
> + break;
> + }
> + return packets_lost;
> +}
> +
> +/**
> + * tcp_rdb_ack_event() - initiate RDB loss detection
> + * @sk: socket
> + * @flags: flags
> + */
> +void tcp_rdb_ack_event(struct sock *sk, u32 flags)
flags are not used
> +{
> + if (rdb_detect_loss(sk))
> + tcp_enter_cwr(sk);
> +}
> +
> +/**
> + * rdb_build_skb() - build a new RDB SKB and copy redundant + unsent
> + * data to the linear page buffer
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission in the output engine
> + * @first_skb: the first SKB in the output queue to be bundled
> + * @bytes_in_rdb_skb: the total number of data bytes for the new
> + * rdb_skb (NEW + Redundant)
> + * @gfp_mask: gfp_t allocation
> + *
> + * Return: A new SKB containing redundant data, or NULL if memory
> + * allocation failed
> + */
> +static struct sk_buff *rdb_build_skb(const struct sock *sk,
> + struct sk_buff *xmit_skb,
> + struct sk_buff *first_skb,
> + u32 bytes_in_rdb_skb,
> + gfp_t gfp_mask)
> +{
> + struct sk_buff *rdb_skb, *tmp_skb = first_skb;
> +
> + rdb_skb = sk_stream_alloc_skb((struct sock *)sk,
> + (int)bytes_in_rdb_skb,
> + gfp_mask, false);
> + if (!rdb_skb)
> + return NULL;
> + copy_skb_header(rdb_skb, xmit_skb);
> + rdb_skb->ip_summed = xmit_skb->ip_summed;
> + TCP_SKB_CB(rdb_skb)->seq = TCP_SKB_CB(first_skb)->seq;
> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(rdb_skb)->seq;
> +
> + /* Start on first_skb and append payload from each SKB in the output
> + * queue onto rdb_skb until we reach xmit_skb.
> + */
> + tcp_for_write_queue_from(tmp_skb, sk) {
> + tcp_skb_append_data(tmp_skb, rdb_skb);
> +
> + /* We reached xmit_skb, containing the unsent data */
> + if (tmp_skb == xmit_skb)
> + break;
> + }
> + return rdb_skb;
> +}
> +
> +/**
> + * rdb_can_bundle_test() - test if redundant data can be bundled
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission by the output engine
> + * @max_payload: the maximum allowed payload bytes for the RDB SKB
> + * @bytes_in_rdb_skb: store the total number of payload bytes in the
> + * RDB SKB if bundling can be performed
> + *
> + * Traverse the output queue and check if any un-acked data may be
> + * bundled.
> + *
> + * Return: The first SKB to be in the bundle, or NULL if no bundling
> + */
> +static struct sk_buff *rdb_can_bundle_test(const struct sock *sk,
> + struct sk_buff *xmit_skb,
> + unsigned int max_payload,
> + u32 *bytes_in_rdb_skb)
> +{
> + struct sk_buff *first_to_bundle = NULL;
> + struct sk_buff *tmp, *skb = xmit_skb->prev;
> + u32 skbs_in_bundle_count = 1; /* Start on 1 to account for xmit_skb */
> + u32 total_payload = xmit_skb->len;
> +
> + if (sysctl_tcp_rdb_max_bytes)
> + max_payload = min_t(unsigned int, max_payload,
> + sysctl_tcp_rdb_max_bytes);
> +
> + /* We start at xmit_skb->prev, and go backwards */
> + tcp_for_write_queue_reverse_from_safe(skb, tmp, sk) {
> + /* Including data from this SKB would exceed payload limit */
> + if ((total_payload + skb->len) > max_payload)
> + break;
> +
> + if (sysctl_tcp_rdb_max_packets &&
> + (skbs_in_bundle_count > sysctl_tcp_rdb_max_packets))
> + break;
> +
> + total_payload += skb->len;
> + skbs_in_bundle_count++;
> + first_to_bundle = skb;
> + }
> + *bytes_in_rdb_skb = total_payload;
> + return first_to_bundle;
> +}
> +
> +/**
> + * tcp_transmit_rdb_skb() - try to create and send an RDB packet
> + * @sk: socket
> + * @xmit_skb: the SKB processed for transmission by the output engine
> + * @mss_now: current mss value
> + * @gfp_mask: gfp_t allocation
> + *
> + * If an RDB packet could not be created and sent, transmit the
> + * original unmodified SKB (xmit_skb).
> + *
> + * Return: 0 if successfully sent packet, else error from
> + * tcp_transmit_skb
> + */
> +int tcp_transmit_rdb_skb(struct sock *sk, struct sk_buff *xmit_skb,
> + unsigned int mss_now, gfp_t gfp_mask)
> +{
> + struct sk_buff *rdb_skb = NULL;
> + struct sk_buff *first_to_bundle;
> + u32 bytes_in_rdb_skb = 0;
> +
> + /* How we detect that RDB was used. When equal, no RDB data was sent */
> + TCP_SKB_CB(xmit_skb)->tx.rdb_start_seq = TCP_SKB_CB(xmit_skb)->seq;
> +
> + if (!tcp_stream_is_thin_dpifl(tcp_sk(sk)))
During loss recovery tcp inflight fluctuates and would like to trigger
this check even for non-thin-stream connections. Since the loss
already occurs, RDB can only take advantage from limited-transmit,
which it likely does not have (b/c its a thin-stream). It might be
checking if the state is open.
> + goto xmit_default;
> +
> + /* No bundling if first in queue, or on FIN packet */
> + if (skb_queue_is_first(&sk->sk_write_queue, xmit_skb) ||
> + (TCP_SKB_CB(xmit_skb)->tcp_flags & TCPHDR_FIN))
seems there are still benefit to bundle packets up to FIN?
> + goto xmit_default;
> +
> + /* Find number of (previous) SKBs to get data from */
> + first_to_bundle = rdb_can_bundle_test(sk, xmit_skb, mss_now,
> + &bytes_in_rdb_skb);
> + if (!first_to_bundle)
> + goto xmit_default;
> +
> + /* Create an SKB that contains redundant data starting from
> + * first_to_bundle.
> + */
> + rdb_skb = rdb_build_skb(sk, xmit_skb, first_to_bundle,
> + bytes_in_rdb_skb, gfp_mask);
> + if (!rdb_skb)
> + goto xmit_default;
> +
> + /* Set skb_mstamp for the SKB in the output queue (xmit_skb) containing
> + * the yet unsent data. Normally this would be done by
> + * tcp_transmit_skb(), but as we pass in rdb_skb instead, xmit_skb's
> + * timestamp will not be touched.
> + */
> + skb_mstamp_get(&xmit_skb->skb_mstamp);
> + rdb_skb->skb_mstamp = xmit_skb->skb_mstamp;
> + return tcp_transmit_skb(sk, rdb_skb, 0, gfp_mask);
> +
> +xmit_default:
> + /* Transmit the unmodified SKB from output queue */
> + return tcp_transmit_skb(sk, xmit_skb, 1, gfp_mask);
> +}
> --
> 1.9.1
>
since RDB will cause DSACKs, and we only blindly count DSACKs to
perform CWND undo. How does RDB handle that false positives?
Powered by blists - more mailing lists