netdev - Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.10.1609081236470.18111@whp-28.cs.helsinki.fi>
Date:   Thu, 8 Sep 2016 14:02:51 +0300 (EEST)
From:   "Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:     Eric Dumazet <eric.dumazet@...il.com>
cc:     David Miller <davem@...emloft.net>,
        netdev <netdev@...r.kernel.org>,
        Yaogong Wang <wygivan@...gle.com>,
        Yuchung Cheng <ycheng@...gle.com>,
        Neal Cardwell <ncardwell@...gle.com>
Subject: Re: [PATCH net-next] tcp: use an RB tree for ooo receive queue

On Wed, 7 Sep 2016, Eric Dumazet wrote:

> From: Yaogong Wang <wygivan@...gle.com>
> 
> Over the years, TCP BDP has increased by several orders of magnitude,
> and some people are considering to reach the 2 Gbytes limit.
> 
> Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
> MSS.
>     
> In presence of packet losses (or reorders), TCP stores incoming packets
> into an out of order queue, and number of skbs sitting there waiting for
> the missing packets to be received can be in the 10^5 range.
> 
> Most packets are appended to the tail of this queue, and when
> packets can finally be transferred to receive queue, we scan the queue
> from its head.
> 
> However, in presence of heavy losses, we might have to find an arbitrary
> point in this queue, involving a linear scan for every incoming packet,
> throwing away cpu caches.
> 
> This patch converts it to a RB tree, to get bounded latencies.
> 
> Yaogong wrote a preliminary patch about 2 years ago.
> Eric did the rebase, added ofo_last_skb cache, polishing and tests.
> 
> Tested with network dropping between 1 and 10 % packets, with good
> success (about 30 % increase of throughput in stress tests)

While testing, was there any check done for the data that was delivered
in order to ensure that no corruption occured (either by you or Yaogong)? 
...This kind of changes have some potential to cause some corruption to
the stream content and it would be nice to be sure there wasn't any
accidents.

> Next step would be to also use an RB tree for the write queue at sender
> side ;)
>
> @@ -4403,8 +4414,10 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
>  static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> +	struct rb_node **p, *q, *parent;
>  	struct sk_buff *skb1;
>  	u32 seq, end_seq;
> +	bool fragstolen;
>  
>  	tcp_ecn_check_ce(tp, skb);
>  
> @@ -4419,88 +4432,85 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
>  	inet_csk_schedule_ack(sk);
>  
>  	NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
> +	seq = TCP_SKB_CB(skb)->seq;
> +	end_seq = TCP_SKB_CB(skb)->end_seq;
>  	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
> -		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
> +		   tp->rcv_nxt, seq, end_seq);
>  
> -	skb1 = skb_peek_tail(&tp->out_of_order_queue);
> -	if (!skb1) {
> +	p = &tp->out_of_order_queue.rb_node;
> +	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
>  		/* Initial out of order segment, build 1 SACK. */
>  		if (tcp_is_sack(tp)) {
>  			tp->rx_opt.num_sacks = 1;
> -			tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
> -			tp->selective_acks[0].end_seq =
> -						TCP_SKB_CB(skb)->end_seq;
> +			tp->selective_acks[0].start_seq = seq;
> +			tp->selective_acks[0].end_seq = end_seq;

>  		}
> -		__skb_queue_head(&tp->out_of_order_queue, skb);
> +		rb_link_node(&skb->rbnode, NULL, p);
> +		rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
> +		tp->ooo_last_skb = skb;
>  		goto end;
>  	}
>  
> -	seq = TCP_SKB_CB(skb)->seq;
> -	end_seq = TCP_SKB_CB(skb)->end_seq;

I hate to nitpick but moving these variables earlier and the 
TCP_SKB_CB(skb)->seq/end_seq simplifications seem unrelated and
could be done in another patch?

> @@ -4830,43 +4876,43 @@ restart:
>  static void tcp_collapse_ofo_queue(struct sock *sk)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
> -	struct sk_buff *head;
> +	struct sk_buff *skb, *head;
> +	struct rb_node *p;
>  	u32 start, end;
>  
> -	if (!skb)
> +	p = rb_first(&tp->out_of_order_queue);
> +	skb = rb_entry_safe(p, struct sk_buff, rbnode);
> +new_range:
> +	if (!skb) {
> +		p = rb_last(&tp->out_of_order_queue);
> +		/* Note: This is possible p is NULL here. We do not
> +		 * use rb_entry_safe(), as ooo_last_skb is valid only
> +		 * if rbtree is not empty.
> +		 */
> +		tp->ooo_last_skb = rb_entry(p, struct sk_buff, rbnode);
>  		return;
> -
> +	}
>  	start = TCP_SKB_CB(skb)->seq;
>  	end = TCP_SKB_CB(skb)->end_seq;
> -	head = skb;
> -
> -	for (;;) {
> -		struct sk_buff *next = NULL;
>  
> -		if (!skb_queue_is_last(&tp->out_of_order_queue, skb))
> -			next = skb_queue_next(&tp->out_of_order_queue, skb);
> -		skb = next;
> +	for (head = skb;;) {
> +		skb = tcp_skb_next(skb, NULL);
>  
> -		/* Segment is terminated when we see gap or when
> -		 * we are at the end of all the queue. */
> +		/* Range is terminated when we see a gap or when
> +		 * we are at the queue end.
> +		 */
>  		if (!skb ||
>  		    after(TCP_SKB_CB(skb)->seq, end) ||
>  		    before(TCP_SKB_CB(skb)->end_seq, start)) {
> -			tcp_collapse(sk, &tp->out_of_order_queue,
> +			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
>  				     head, skb, start, end);
> -			head = skb;
> -			if (!skb)
> -				break;
> -			/* Start new segment */
> +			goto new_range;
> +		}
> +
> +		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
>  			start = TCP_SKB_CB(skb)->seq;
> +		if (after(TCP_SKB_CB(skb)->end_seq, end))
>  			end = TCP_SKB_CB(skb)->end_seq;
> -		} else {
> -			if (before(TCP_SKB_CB(skb)->seq, start))
> -				start = TCP_SKB_CB(skb)->seq;
> -			if (after(TCP_SKB_CB(skb)->end_seq, end))
> -				end = TCP_SKB_CB(skb)->end_seq;
> -		}
>  	}
>  }

I tried long to think if I could propose alternative layout which would 
make this function to exit from the end but couldn't come up anything 
sensible. As is, it's always exiting within that top if block which is
somewhat unintuitive :-).

Acked-By: Ilpo Järvinen <ilpo.jarvinen@...sinki>

-- 
 i.