lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Sun,  2 Dec 2007 00:48:16 +0200
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	David Miller <davem@...emloft.net>,
	Herbert Xu <herbert@...dor.apana.org.au>
Cc:	netdev@...r.kernel.org,
	Stephen Hemminger <shemminger@...ux-foundation.org>,
	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
Subject: [PATCH 21/21] [TCP]: Split write queue into two parts (SACKed and not)

From: =?ISO-8859-1?q?Ilpo_J=E4rvinen?= <ilpo.jarvinen@...sinki.fi>

Due to linked list implementation, which must occasionally spend
huge effort to locate the correct skb, tagging per skb scoreboard
information is all the time becoming heavier operation because the
window sizes are growing. This processing takes considerable
amount of time and is seriously degraging throughput of a large
windowed TCP because in the worst case RTOs occur due to SACK
latencies.

Another thing to be considered, which is the main motivation for
this work, is a malicious peer who intentionally forces TCP to
do such heavy processing per ACK. Currently there's very little
being done to prevent such resource consumption. We would like
to bound processing to O(outstanding skbs) per outstanding
window.

Problem with scoreboard was/is two-fold:

1) Does O(outstanding skbs) walk per ACK when finding beginning
   of a SACK block
2) Does O(outstanding skbs) walk per ACK when marking the new
   information

Sadly enough there's one more thing, the DSACKs may point to
already SACKed area. Again this is currently thing that forces
to do O(outstanding skbs) processing (3).

Without RB-tree at all (or other log(n) search structure),
properties 1 and 3 are our obstable. A single skb space for
all skbs fails to meet desired bound because of property 2.
This was the current situation though we optimized the typical
case by the help of recv_sack_cache and highest_sack skb
fast-forward reference quite well.

We introduced RB-tree in earlier patch making searching for
the skb fast (fixes 1 and 3). To combat property 2, we separate
new SACK info and DSACK problem spaces. Just removing SACKed
items from RB-tree fails because DSACKs are no longer fast
searchable which malicious peers can take advantage of. For
a fast search and O(new work only) walk the RB-tree must return
an skb that belongs to new work, otherwise walking may do extra
work equal to property 2. Therefore a separate RB-trees for
non-SACKed and SACKed space is required.

...well, that's a short introduction.

This is somewhat costly but simple approach to do it. It can
be optimized later.

The non-SACKed space will contain also S|R skbs because
mark_lost_retrans loop must be able to find them efficiently.

RB-tree insertion had a nice side-effect we take advantage of.
The insertion point for sacked list is found without additional
cost (we still maintain linked list parallel with RB-tree,
this may change in future though).

Minor changes in tcp_input.c had to performed to fix assumption
that are no longer valid because tcp_for_write_queue won't
necessarily return every item like previously.

recv_sack_cache and tp->highest_sack may be obsolete after
this. At least the highest item of recv_sack_cache is very
likely redundant.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@...sinki.fi>
---
 include/linux/tcp.h   |    2 +
 include/net/tcp.h     |  276 +++++++++++++++++++++++++++++++++++++++++-------
 net/ipv4/tcp_input.c  |  236 +++++++++++++++++++++--------------------
 net/ipv4/tcp_output.c |    4 +-
 4 files changed, 360 insertions(+), 158 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c24fc06..56342c3 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -322,6 +322,8 @@ struct tcp_sock {
 	u32	snd_cwnd_stamp;
 
 	struct rb_root		write_queue_rb;
+	struct rb_root		sacked_queue_rb;
+	struct sk_buff_head	sacked_queue;
 	struct sk_buff_head	out_of_order_queue; /* Out of order segments go here */
 
  	u32	rcv_wnd;	/* Current receiver window		*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 688ccca..7ae72c3 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1190,49 +1190,113 @@ static inline void		tcp_put_md5sig_pool(void)
 	put_cpu();
 }
 
+/* write queue abstraction */
+#define TCP_WQ_SACKED	1
+
+static inline struct sk_buff_head *__tcp_list_select(struct sock *sk, const int queue)
+{
+	if (queue == TCP_WQ_SACKED)
+		return &tcp_sk(sk)->sacked_queue;
+	else
+		return &sk->sk_write_queue;
+}
+
+static inline struct rb_root *__tcp_tree_select(struct sock *sk, const int tree)
+{
+	if (tree == TCP_WQ_SACKED)
+		return &tcp_sk(sk)->sacked_queue_rb;
+	else
+		return &tcp_sk(sk)->write_queue_rb;
+}
+
+/* All SACKed except S|R go to a separate skb space */
+static inline int __tcp_skb_queue_select(const struct sk_buff *skb)
+{
+	if ((TCP_SKB_CB(skb)->sacked &
+			(TCPCB_SACKED_ACKED|TCPCB_SACKED_RETRANS)) ==
+			TCPCB_SACKED_ACKED)
+		return TCP_WQ_SACKED;
+	else
+		return 0;
+}
+
 static inline void tcp_write_queue_init(struct sock *sk)
 {
 	tcp_sk(sk)->write_queue_rb = RB_ROOT;
+	tcp_sk(sk)->sacked_queue_rb = RB_ROOT;
+	skb_queue_head_init(&tcp_sk(sk)->sacked_queue);
 }
 
-/* write queue abstraction */
-static inline void tcp_write_queue_purge(struct sock *sk)
+static inline void __tcp_write_queue_purge(struct sock *sk, int queue)
 {
 	struct sk_buff *skb;
 
-	while ((skb = __skb_dequeue(&sk->sk_write_queue)) != NULL)
+	while ((skb = __skb_dequeue(__tcp_list_select(sk, queue))) != NULL)
 		sk_stream_free_skb(sk, skb);
-	tcp_sk(sk)->write_queue_rb = RB_ROOT;
+	*__tcp_tree_select(sk, queue) = RB_ROOT;
+}
+
+static inline void tcp_write_queue_purge(struct sock *sk)
+{
+	__tcp_write_queue_purge(sk, 0);
+	__tcp_write_queue_purge(sk, TCP_WQ_SACKED);
 	sk_stream_mem_reclaim(sk);
 }
 
-static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
+static inline struct sk_buff *__tcp_write_queue_head(struct sock *sk, int queue)
 {
-	struct sk_buff *skb = sk->sk_write_queue.next;
-	if (skb == (struct sk_buff *) &sk->sk_write_queue)
+	struct sk_buff *skb = __tcp_list_select(sk, queue)->next;
+	if (skb == (struct sk_buff *)__tcp_list_select(sk, queue))
 		return NULL;
 	return skb;
 }
 
+static inline struct sk_buff *tcp_write_queue_head(struct sock *sk)
+{
+	return __tcp_write_queue_head(sk, 0);
+}
+
 /* FIXME, this should eventually vanish because callers likely benefit
  * from scanning the non-SACKed and SACKed spaces separately.
  */
 static inline struct sk_buff *tcp_real_queue_head(struct sock *sk)
 {
-	return tcp_write_queue_head(sk);
+	struct sk_buff *skb, *sacked;
+
+	skb = tcp_write_queue_head(sk);
+	sacked = __tcp_write_queue_head(sk, TCP_WQ_SACKED);
+
+	if (skb == NULL)
+		return sacked;
+	if (sacked == NULL)
+		return skb;
+
+	if (after(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(sacked)->seq))
+		return sacked;
+	return skb;
 }
 
-static inline struct sk_buff *tcp_write_queue_tail(struct sock *sk)
+static inline struct sk_buff *__tcp_write_queue_tail(struct sock *sk, int queue)
 {
-	struct sk_buff *skb = sk->sk_write_queue.prev;
-	if (skb == (struct sk_buff *) &sk->sk_write_queue)
+	struct sk_buff *skb = __tcp_list_select(sk, queue)->prev;
+	if (skb == (struct sk_buff *)__tcp_list_select(sk, queue))
 		return NULL;
 	return skb;
 }
 
+static inline struct sk_buff *tcp_write_queue_tail(struct sock *sk)
+{
+	return __tcp_write_queue_tail(sk, 0);
+}
+
+static inline int __tcp_write_queue_empty(struct sock *sk, int queue)
+{
+	return skb_queue_empty(__tcp_list_select(sk, queue));
+}
+
 static inline int tcp_write_queue_empty(struct sock *sk)
 {
-	return skb_queue_empty(&sk->sk_write_queue);
+	return __tcp_write_queue_empty(sk, 0);
 }
 
 static inline struct sk_buff *tcp_write_queue_next(struct sock *sk, struct sk_buff *skb)
@@ -1252,17 +1316,17 @@ static inline int tcp_skb_adjacent(struct sock *sk, struct sk_buff *skb,
 }
 
 #define tcp_for_write_queue(skb, sk, queue)				\
-		for (skb = (sk)->sk_write_queue.next;			\
-		     (skb != (struct sk_buff *)&(sk)->sk_write_queue);	\
+		for (skb = __tcp_list_select(sk, queue)->next;		\
+		     (skb != (struct sk_buff *)__tcp_list_select(sk, queue));\
 		     skb = skb->next)
 
 #define tcp_for_write_queue_from(skb, sk, queue)			\
-		for (; (skb != (struct sk_buff *)&(sk)->sk_write_queue);\
+		for (; (skb != (struct sk_buff *)__tcp_list_select(sk, queue));\
 		     skb = skb->next)
 
 #define tcp_for_write_queue_from_safe(skb, tmp, sk, queue)		\
 		for (tmp = skb->next;					\
-		     (skb != (struct sk_buff *)&(sk)->sk_write_queue);	\
+		     (skb != (struct sk_buff *)__tcp_list_select(sk, queue));\
 		     skb = tmp, tmp = skb->next)
 
 static inline struct sk_buff *tcp_send_head(struct sock *sk)
@@ -1293,9 +1357,9 @@ static inline void tcp_init_send_head(struct sock *sk)
 	sk->sk_send_head = NULL;
 }
 
-static inline struct sk_buff *tcp_write_queue_find(struct sock *sk, __u32 seq)
+static inline struct sk_buff *__tcp_write_queue_find(struct rb_node *rb_node,
+						     __u32 seq)
 {
-	struct rb_node *rb_node = tcp_sk(sk)->write_queue_rb.rb_node;
 	struct sk_buff *skb = NULL;
 
 	while (rb_node) {
@@ -1312,9 +1376,21 @@ static inline struct sk_buff *tcp_write_queue_find(struct sock *sk, __u32 seq)
 	return skb;
 }
 
-static inline void tcp_rb_insert(struct sk_buff *skb, struct rb_root *root)
+static inline struct sk_buff *tcp_write_queue_find(struct sock *sk, __u32 seq, int tree)
+{
+	return __tcp_write_queue_find(__tcp_tree_select(sk, tree)->rb_node, seq);
+}
+
+/* Inserts skb into RB-tree root, prev node (ie., the skb before the inserted
+ * one) is returned, which is available as a side-effect from  parent of the
+ * last rb_right edge. If no rb_right edge is walked, NULL is returned (tree
+ * does not contain a smaller node).
+ */
+static struct sk_buff *__tcp_rb_insert(struct sk_buff *skb,
+				       struct rb_root *root)
 {
 	struct rb_node **rb_link, *rb_parent;
+	struct sk_buff *prev = NULL;
 	__u32 seq = TCP_SKB_CB(skb)->seq;
 
 	rb_link = &root->rb_node;
@@ -1329,10 +1405,18 @@ static inline void tcp_rb_insert(struct sk_buff *skb, struct rb_root *root)
 			rb_link = &rb_parent->rb_left;
 		} else {
 			rb_link = &rb_parent->rb_right;
+			prev = tmp;
 		}
 	}
 	rb_link_node(&skb->rb, rb_parent, rb_link);
 	rb_insert_color(&skb->rb, root);
+
+	return prev;
+}
+
+static inline void tcp_rb_insert(struct sk_buff *skb, struct rb_root *root)
+{
+	__tcp_rb_insert(skb, root);
 }
 
 static inline void tcp_rb_unlink(struct sk_buff *skb, struct rb_root *root)
@@ -1372,59 +1456,166 @@ static inline void __tcp_add_write_queue_head(struct sock *sk, struct sk_buff *s
 	tcp_rb_insert(skb, &tcp_sk(sk)->write_queue_rb);
 }
 
-/* An insert into the middle of the write queue causes the fack
- * counts in subsequent packets to become invalid, fix them up.
- */
-static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *skb)
+static inline struct sk_buff *__tcp_reset_fack_counts(struct sock *sk,
+						      struct sk_buff *skb,
+						      struct sk_buff **prev,
+						      int queue)
 {
-	struct sk_buff *prev = skb->prev;
 	unsigned int fc = 0;
 
-	if (prev != (struct sk_buff *) &sk->sk_write_queue)
-		fc = TCP_SKB_CB(prev)->fack_count + tcp_skb_pcount(prev);
+	if (prev == NULL)
+		fc = TCP_SKB_CB(*prev)->fack_count + tcp_skb_pcount(*prev);
 
-	tcp_for_write_queue_from(skb, sk, 0) {
-		if (!before(TCP_SKB_CB(skb)->seq, tcp_sk(sk)->snd_nxt) ||
-		    TCP_SKB_CB(skb)->fack_count == fc)
+	BUG_ON((*prev != NULL) && !tcp_skb_adjacent(sk, *prev, skb));
+
+	tcp_for_write_queue_from(skb, sk, queue) {
+		if ((*prev != NULL) && !tcp_skb_adjacent(sk, *prev, skb))
 			break;
 
+		/* Terminate whole search? */
+		if (!before(TCP_SKB_CB(skb)->seq, tcp_sk(sk)->snd_nxt) ||
+		    TCP_SKB_CB(skb)->fack_count == fc) {
+			*prev = NULL;
+			return NULL;
+		}
+
 		TCP_SKB_CB(skb)->fack_count = fc;
 		fc += tcp_skb_pcount(skb);
+
+		*prev = skb;
+	}
+
+	return skb;
+}
+
+/* An insert into the middle of the write queue causes the fack
+ * counts in subsequent packets to become invalid, fix them up.
+ *
+ * FIXME, this definately could be improved!
+ */
+static inline void tcp_reset_fack_counts(struct sock *sk, struct sk_buff *inskb)
+{
+	struct sk_buff *prev;
+	struct sk_buff *skb[2] = {NULL, NULL};
+	int queue;
+
+	if (!before(TCP_SKB_CB(inskb)->seq, tcp_sk(sk)->snd_nxt))
+		return;
+
+	queue = __tcp_skb_queue_select(inskb);
+	skb[queue] = inskb;
+
+	prev = inskb->prev;
+	if (inskb == __tcp_write_queue_head(sk, queue))
+		prev = NULL;
+
+	if (((prev != NULL) && !tcp_skb_adjacent(sk, prev, inskb)) ||
+	    ((prev == NULL) && (TCP_SKB_CB(inskb)->seq != tcp_sk(sk)->snd_una))) {
+		int otherq = queue ^ TCP_WQ_SACKED;
+
+		BUG_ON (__tcp_write_queue_empty(sk, otherq));
+		prev = tcp_write_queue_find(sk, TCP_SKB_CB(inskb)->seq - 1,
+					    otherq);
+		BUG_ON (prev == NULL || prev == tcp_send_head(sk));
+		skb[otherq] = prev->next;
+	}
+
+	while (skb[queue] != __tcp_write_queue_tail(sk, queue)) {
+		/* Lazy find for the other queue */
+		if (skb[queue] == NULL) {
+			skb[queue] = tcp_write_queue_find(sk, TCP_SKB_CB(prev)->seq,
+						          queue ^ TCP_WQ_SACKED);
+			if (skb[queue] == NULL)
+				break;
+		}
+
+		skb[queue] = __tcp_reset_fack_counts(sk, skb[queue], &prev, queue);
+		if (skb[queue] == NULL)
+			break;
+
+		queue ^= TCP_WQ_SACKED;
 	}
 }
 
+static inline void __tcp_insert_write_queue_after(struct sk_buff *skb,
+						  struct sk_buff *buff,
+						  struct sock *sk,
+						  int queue)
+{
+	__skb_append(skb, buff, __tcp_list_select(sk, queue));
+	tcp_rb_insert(buff, __tcp_tree_select(sk, queue));
+}
+
 /* Insert buff after skb on the write queue of sk.  */
 static inline void tcp_insert_write_queue_after(struct sk_buff *skb,
 						struct sk_buff *buff,
 						struct sock *sk)
 {
-	__skb_append(skb, buff, &sk->sk_write_queue);
+	__tcp_insert_write_queue_after(skb, buff, sk, __tcp_skb_queue_select(buff));
 	tcp_reset_fack_counts(sk, buff);
-	tcp_rb_insert(buff, &tcp_sk(sk)->write_queue_rb);
 }
 
-/* Insert new before skb on the write queue of sk.  */
-static inline void tcp_insert_write_queue_before(struct sk_buff *new,
-						  struct sk_buff *skb,
-						  struct sock *sk)
+/* Insert new before skb on the write queue of sk.
+ *
+ * This is only used for tcp_mtu_probe() new send_head injection. If that
+ * stops being true, needs to consider fack_counts and TCP_WQ_SACKED.
+ */
+static inline void __tcp_insert_write_queue_before(struct sk_buff *new,
+						   struct sk_buff *skb,
+						   struct sock *sk)
 {
+	BUG_ON(sk->sk_send_head != skb);
+
 	__skb_insert(new, skb->prev, skb, &sk->sk_write_queue);
-	tcp_reset_fack_counts(sk, new);
 	tcp_rb_insert(new, &tcp_sk(sk)->write_queue_rb);
-
-	if (sk->sk_send_head == skb)
-		sk->sk_send_head = new;
+	sk->sk_send_head = new;
 }
 
 static inline void tcp_unlink_write_queue(struct sk_buff *skb, struct sock *sk)
 {
-	__skb_unlink(skb, &sk->sk_write_queue);
-	tcp_rb_unlink(skb, &tcp_sk(sk)->write_queue_rb);
+	int queue = __tcp_skb_queue_select(skb);
+
+	__skb_unlink(skb, __tcp_list_select(sk, queue));
+	tcp_rb_unlink(skb, __tcp_tree_select(sk, queue));
+}
+
+/* Moves skb to queue part of the skb space, a bit fragile, call must be made
+ * prior (important) sacked changes (= ->S and &~R)
+ */
+static inline void tcp_write_queue_requeue(struct sk_buff *skb,
+					   struct sock *sk, int queue)
+{
+	struct sk_buff *prev;
+
+	/* FIXME, most of hints are to be dropped soon... */
+	if (tcp_sk(sk)->scoreboard_skb_hint == skb)
+		tcp_sk(sk)->scoreboard_skb_hint = skb->next;
+	if (tcp_sk(sk)->forward_skb_hint == skb)
+		tcp_sk(sk)->forward_skb_hint = skb->next;
+	/* ...These have related cnt */
+	if (tcp_sk(sk)->lost_skb_hint == skb)
+		tcp_sk(sk)->lost_skb_hint = NULL;
+	if (tcp_sk(sk)->retransmit_skb_hint == skb)
+		tcp_sk(sk)->retransmit_skb_hint = NULL;
+
+	/* S|R must not be in SACKed space because of mark_lost_retrans walk */
+	if ((queue == TCP_WQ_SACKED) &&
+	    (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS))
+		return;
+
+	tcp_unlink_write_queue(skb, sk);
+
+	prev = __tcp_rb_insert(skb, __tcp_tree_select(sk, queue));
+	if (prev == NULL)
+		prev = (struct sk_buff *)__tcp_list_select(sk, queue);
+	__skb_append(prev, skb, __tcp_list_select(sk, queue));
 }
 
 static inline int tcp_skb_is_last(const struct sock *sk,
 				  const struct sk_buff *skb)
 {
+	BUG_ON(__tcp_skb_queue_select(skb) == TCP_WQ_SACKED);
+
 	return skb->next == (struct sk_buff *)&sk->sk_write_queue;
 }
 
@@ -1442,6 +1633,9 @@ static inline u32 tcp_highest_sack_seq(struct tcp_sock *tp)
 	return TCP_SKB_CB(tp->highest_sack)->seq;
 }
 
+/* This is somewhat dangerous now, because skb must still be in non-sacked
+ * space
+ */
 static inline void tcp_advance_highest_sack(struct sock *sk, struct sk_buff *skb)
 {
 	tcp_sk(sk)->highest_sack = tcp_skb_is_last(sk, skb) ? NULL :
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1187dfa..227977d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1072,7 +1072,7 @@ static void tcp_update_reordering(struct sock *sk, const int metric,
  * the exact amount is rather hard to quantify. However, tp->max_window can
  * be used as an exaggerated estimate.
  */
-static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
+static int tcp_is_sackblock_valid(struct tcp_sock *tp,
 				  u32 start_seq, u32 end_seq)
 {
 	/* Too far in future, or reversed (interpretation is ambiguous) */
@@ -1089,10 +1089,16 @@ static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
 	if (after(start_seq, tp->snd_una))
 		return 1;
 
-	if (!is_dsack || !tp->undo_marker)
+	return 0;
+}
+
+static int tcp_is_past_dsack_useful(struct tcp_sock *tp,
+				    u32 start_seq, u32 end_seq)
+{
+	if (!tp->undo_marker)
 		return 0;
 
-	/* ...Then it's D-SACK, and must reside below snd_una completely */
+	/* ...Past D-SACK must reside below snd_una completely */
 	if (!after(end_seq, tp->snd_una))
 		return 0;
 
@@ -1149,6 +1155,10 @@ static void tcp_mark_lost_retrans(struct sock *sk)
 		    (tcp_is_fack(tp) ||
 		     !before(received_upto,
 			     ack_seq + tp->reordering * tp->mss_cache))) {
+
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
+				tcp_write_queue_requeue(skb, sk, TCP_WQ_SACKED);
+
 			TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
 			tp->retrans_out -= tcp_skb_pcount(skb);
 
@@ -1262,6 +1272,7 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 
 		if (!before(TCP_SKB_CB(skb)->seq, tcp_highest_sack_seq(tp)))
 			tcp_advance_highest_sack(sk, skb);
+		tcp_write_queue_requeue(skb, sk, TCP_WQ_SACKED);
 
 		TCP_SKB_CB(skb)->sacked |= TCPCB_SACKED_ACKED;
 		flag |= FLAG_DATA_SACKED;
@@ -1284,6 +1295,8 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 	 * are accounted above as well.
 	 */
 	if (dup_sack && (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_RETRANS)) {
+		tcp_write_queue_requeue(skb, sk, TCP_WQ_SACKED);
+
 		TCP_SKB_CB(skb)->sacked &= ~TCPCB_SACKED_RETRANS;
 		tp->retrans_out -= tcp_skb_pcount(skb);
 		tp->retransmit_skb_hint = NULL;
@@ -1293,17 +1306,17 @@ static int tcp_sacktag_one(struct sk_buff *skb, struct sock *sk,
 }
 
 static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
-					struct tcp_sack_block *next_dup,
 					u32 start_seq, u32 end_seq,
-					int dup_sack_in, int *reord, int *flag)
+					int dup_sack, int *reord, int *flag,
+					int queue)
 {
+	struct sk_buff *next;
 	unsigned int fack_count_base;
 
 	fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count;
 
-	tcp_for_write_queue_from(skb, sk, 0) {
+	tcp_for_write_queue_from_safe(skb, next, sk, queue) {
 		int in_sack = 0;
-		int dup_sack = dup_sack_in;
 		unsigned int fack_count;
 
 		if (skb == tcp_send_head(sk))
@@ -1313,17 +1326,7 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 		if (!before(TCP_SKB_CB(skb)->seq, end_seq))
 			break;
 
-		if ((next_dup != NULL) &&
-		    before(TCP_SKB_CB(skb)->seq, next_dup->end_seq)) {
-			in_sack = tcp_match_skb_to_sack(sk, skb,
-							next_dup->start_seq,
-							next_dup->end_seq);
-			if (in_sack > 0)
-				dup_sack = 1;
-		}
-
-		if (in_sack <= 0)
-			in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq);
+		in_sack = tcp_match_skb_to_sack(sk, skb, start_seq, end_seq);
 		if (unlikely(in_sack < 0))
 			break;
 
@@ -1338,70 +1341,72 @@ static struct sk_buff *tcp_sacktag_walk(struct sk_buff *skb, struct sock *sk,
 /* Avoid all extra work that is being done by sacktag while walking in
  * a normal way
  */
-static struct sk_buff *tcp_sacktag_skip(struct sk_buff *skb, struct sock *sk,
-					u32 skip_to_seq)
+static struct sk_buff *tcp_sacktag_skip(struct sock *sk, u32 skip_to_seq)
 {
-	tcp_for_write_queue_from(skb, sk, 0) {
-		if (skb == tcp_send_head(sk))
-			break;
-
-		if (!before(TCP_SKB_CB(skb)->end_seq, skip_to_seq))
-			break;
-	}
-	return skb;
-}
-
-static struct sk_buff *tcp_maybe_skipping_dsack(struct sk_buff *skb,
-						struct sock *sk,
-						struct tcp_sack_block *next_dup,
-						u32 skip_to_seq,
-						int *reord,
-						int *flag)
-{
-	if (next_dup == NULL)
-		return skb;
-
-	if (before(next_dup->start_seq, skip_to_seq)) {
-		skb = tcp_sacktag_skip(skb, sk, next_dup->start_seq);
-		tcp_sacktag_walk(skb, sk, NULL,
-				 next_dup->start_seq, next_dup->end_seq,
-				 1, reord, flag);
-	}
+	struct sk_buff *skb;
 
+	skb = tcp_write_queue_find(sk, skip_to_seq, 0);
+	if (skb == tcp_write_queue_head(sk))
+		skb = NULL;
 	return skb;
 }
 
-static int tcp_check_dsack(struct tcp_sock *tp, struct sk_buff *ack_skb,
-			   struct tcp_sack_block_wire *sp, int num_sacks,
-			   u32 prior_snd_una)
+static int tcp_handle_dsack(struct sock *sk, struct sk_buff *ack_skb,
+			    struct tcp_sack_block_wire *sp, u32 *reord,
+			    int num_sacks, u32 prior_snd_una)
 {
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb;
 	u32 start_seq_0 = ntohl(get_unaligned(&sp[0].start_seq));
 	u32 end_seq_0 = ntohl(get_unaligned(&sp[0].end_seq));
-	int dup_sack = 0;
+	int flag = 0;
 
 	if (before(start_seq_0, TCP_SKB_CB(ack_skb)->ack_seq)) {
-		dup_sack = 1;
+		flag |= FLAG_DSACKING_ACK;
 		tcp_dsack_seen(tp);
 		NET_INC_STATS_BH(LINUX_MIB_TCPDSACKRECV);
+
+		if (!tcp_is_past_dsack_useful(tp, start_seq_0, end_seq_0)) {
+			if (!tp->undo_marker)
+				NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO);
+			else
+				NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD);
+
+			return flag;
+		}
+
+		/* D-SACK for already forgotten data... Do dumb counting. */
+		if (!after(end_seq_0, prior_snd_una))
+			tp->undo_retrans--;
+
 	} else if (num_sacks > 1) {
 		u32 end_seq_1 = ntohl(get_unaligned(&sp[1].end_seq));
 		u32 start_seq_1 = ntohl(get_unaligned(&sp[1].start_seq));
 
 		if (!after(end_seq_0, end_seq_1) &&
 		    !before(start_seq_0, start_seq_1)) {
-			dup_sack = 1;
+			flag |= FLAG_DSACKING_ACK;
 			tcp_dsack_seen(tp);
 			NET_INC_STATS_BH(LINUX_MIB_TCPDSACKOFORECV);
+			if (!tcp_is_sackblock_valid(tp, start_seq_0, end_seq_0)) {
+				/* FIXME, reordering check like in the other place! */
+				NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
+				return flag;
+			}
 		}
 	}
 
-	/* D-SACK for already forgotten data... Do dumb counting. */
-	if (dup_sack &&
-	    !after(end_seq_0, prior_snd_una) &&
-	    after(end_seq_0, tp->undo_marker))
-		tp->undo_retrans--;
+	if ((flag & FLAG_DSACKING_ACK) && after(end_seq_0, prior_snd_una)) {
+		skb = tcp_write_queue_find(sk, start_seq_0, TCP_WQ_SACKED);
+		if (skb != NULL)
+			tcp_sacktag_walk(skb, sk, start_seq_0, end_seq_0, 1, reord, &flag, TCP_WQ_SACKED);
+
+		skb = tcp_write_queue_find(sk, start_seq_0, 0);
+		if (skb != NULL)
+			tcp_sacktag_walk(skb, sk, start_seq_0, end_seq_0, 1, reord, &flag, 0);
+	}
 
-	return dup_sack;
+	return flag;
 }
 
 static int tcp_sack_cache_ok(struct tcp_sock *tp, struct tcp_sack_block *cache)
@@ -1424,9 +1429,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	int used_sacks;
 	int reord = tp->packets_out;
 	int flag = 0;
-	int found_dup_sack = 0;
 	int i, j;
-	int first_sack_index;
 
 	if (!tp->sacked_out) {
 		if (WARN_ON(tp->fackets_out))
@@ -1434,10 +1437,7 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		tcp_highest_sack_reset(sk);
 	}
 
-	found_dup_sack = tcp_check_dsack(tp, ack_skb, sp_wire,
-					 num_sacks, prior_snd_una);
-	if (found_dup_sack)
-		flag |= FLAG_DSACKING_ACK;
+	flag |= tcp_handle_dsack(sk, ack_skb, sp_wire, &reord, num_sacks, prior_snd_una);
 
 	/* Eliminate too old ACKs, but take into
 	 * account more or less fresh ones, they can
@@ -1450,30 +1450,17 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 		goto out;
 
 	used_sacks = 0;
-	first_sack_index = 0;
-	for (i = 0; i < num_sacks; i++) {
-		int dup_sack = !i && found_dup_sack;
-
+	for (i = (flag & FLAG_DSACKING_ACK) ? 1 : 0; i < num_sacks; i++) {
 		sp[used_sacks].start_seq = ntohl(get_unaligned(&sp_wire[i].start_seq));
 		sp[used_sacks].end_seq = ntohl(get_unaligned(&sp_wire[i].end_seq));
 
-		if (!tcp_is_sackblock_valid(tp, dup_sack,
-					    sp[used_sacks].start_seq,
+		if (!tcp_is_sackblock_valid(tp, sp[used_sacks].start_seq,
 					    sp[used_sacks].end_seq)) {
-			if (dup_sack) {
-				if (!tp->undo_marker)
-					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDNOUNDO);
-				else
-					NET_INC_STATS_BH(LINUX_MIB_TCPDSACKIGNOREDOLD);
-			} else {
-				/* Don't count olds caused by ACK reordering */
-				if ((TCP_SKB_CB(ack_skb)->ack_seq != tp->snd_una) &&
-				    !after(sp[used_sacks].end_seq, tp->snd_una))
-					continue;
-				NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
-			}
-			if (i == 0)
-				first_sack_index = -1;
+			/* Don't count olds caused by ACK reordering */
+			if ((TCP_SKB_CB(ack_skb)->ack_seq != tp->snd_una) &&
+			    !after(sp[used_sacks].end_seq, tp->snd_una))
+				continue;
+			NET_INC_STATS_BH(LINUX_MIB_TCPSACKDISCARD);
 			continue;
 		}
 
@@ -1493,10 +1480,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				tmp = sp[j];
 				sp[j] = sp[j+1];
 				sp[j+1] = tmp;
-
-				/* Track where the first SACK block goes to */
-				if (j == first_sack_index)
-					first_sack_index = j+1;
 			}
 		}
 	}
@@ -1518,11 +1501,6 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 	while (i < used_sacks) {
 		u32 start_seq = sp[i].start_seq;
 		u32 end_seq = sp[i].end_seq;
-		int dup_sack = (found_dup_sack && (i == first_sack_index));
-		struct tcp_sack_block *next_dup = NULL;
-
-		if (found_dup_sack && ((i + 1) == first_sack_index))
-			next_dup = &sp[i + 1];
 
 		/* Event "B" in the comment above. */
 		if (after(end_seq, tp->high_seq))
@@ -1534,24 +1512,23 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			cache++;
 
 		/* Can skip some work by looking recv_sack_cache? */
-		if (tcp_sack_cache_ok(tp, cache) && !dup_sack &&
+		if (tcp_sack_cache_ok(tp, cache) &&
 		    after(end_seq, cache->start_seq)) {
 
 			/* Head todo? */
 			if (before(start_seq, cache->start_seq)) {
-				skb = tcp_sacktag_skip(skb, sk, start_seq);
-				skb = tcp_sacktag_walk(skb, sk, next_dup, start_seq,
-						       cache->start_seq, dup_sack,
-						       &reord, &flag);
+				skb = tcp_sacktag_skip(sk, start_seq);
+				if (skb == NULL)
+					break;
+				skb = tcp_sacktag_walk(skb, sk, start_seq,
+						       cache->start_seq, 0,
+						       &reord, &flag, 0);
 			}
 
 			/* Rest of the block already fully processed? */
 			if (!after(end_seq, cache->end_seq))
 				goto advance_sp;
 
-			skb = tcp_maybe_skipping_dsack(skb, sk, next_dup, cache->end_seq,
-						       &reord, &flag);
-
 			/* ...tail remains todo... */
 			if (tcp_highest_sack_seq(tp) == cache->end_seq) {
 				/* ...but better entrypoint exists! */
@@ -1562,7 +1539,9 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 				goto walk;
 			}
 
-			skb = tcp_sacktag_skip(skb, sk, cache->end_seq);
+			skb = tcp_sacktag_skip(sk, cache->end_seq);
+			if (skb == NULL)
+				break;
 			/* Check overlap against next cached too (past this one already) */
 			cache++;
 			continue;
@@ -1573,11 +1552,13 @@ tcp_sacktag_write_queue(struct sock *sk, struct sk_buff *ack_skb, u32 prior_snd_
 			if (skb == NULL)
 				break;
 		}
-		skb = tcp_sacktag_skip(skb, sk, start_seq);
+		skb = tcp_sacktag_skip(sk, start_seq);
+		if (skb == NULL)
+			break;
 
 walk:
-		skb = tcp_sacktag_walk(skb, sk, next_dup, start_seq, end_seq,
-				       dup_sack, &reord, &flag);
+		skb = tcp_sacktag_walk(skb, sk, start_seq, end_seq,
+				       0, &reord, &flag, 0);
 
 advance_sp:
 		/* SACK enhanced FRTO (RFC4138, Appendix B): Clearing correct
@@ -1674,6 +1655,7 @@ int tcp_use_frto(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
+	struct sk_buff *notsacked;	/* Or S|R => deny basic F-RTO */
 
 	if (!sysctl_tcp_frto)
 		return 0;
@@ -1685,15 +1667,19 @@ int tcp_use_frto(struct sock *sk)
 	if (tp->retrans_out > 1)
 		return 0;
 
-	skb = tcp_write_queue_head(sk);
-	skb = tcp_write_queue_next(sk, skb);	/* Skips head */
-	tcp_for_write_queue_from(skb, sk, 0) {
-		if (skb == tcp_send_head(sk))
-			break;
+	notsacked = tcp_write_queue_head(sk);
+	/* Not interested in head skb here because F-RTO is reentrable if only
+	 * head skb has been retransmitted (equals to multiple RTOs case)
+	 */
+	notsacked = tcp_write_queue_next(sk, notsacked);
+	if ((notsacked != NULL) && TCP_SKB_CB(notsacked)->sacked & TCPCB_RETRANS)
+		return 0;
+
+	tcp_for_write_queue(skb, sk, TCP_WQ_SACKED) {
 		if (TCP_SKB_CB(skb)->sacked&TCPCB_RETRANS)
 			return 0;
-		/* Short-circuit when first non-SACKed skb has been checked */
-		if (!(TCP_SKB_CB(skb)->sacked&TCPCB_SACKED_ACKED))
+		/* Short-circuit when past first non-SACKed skb */
+		if (after(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(notsacked)->seq))
 			break;
 	}
 	return 1;
@@ -1892,6 +1878,13 @@ void tcp_enter_loss(struct sock *sk, int how)
 		tp->sacked_out = 0;
 		tp->fackets_out = 0;
 		tcp_clear_all_retrans_hints(tp);
+
+		tcp_for_write_queue(skb, sk, TCP_WQ_SACKED) {
+			/* FIXME, this could be optimized by avoiding tree
+			 * deletes
+			 */
+			tcp_write_queue_requeue(skb, sk, 0);
+		}
 	}
 
 	tcp_for_write_queue(skb, sk, 0) {
@@ -2135,6 +2128,10 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit)
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
 	int cnt;
+	unsigned int fc;
+	unsigned int fack_count_base;
+
+	fack_count_base = TCP_SKB_CB(tcp_write_queue_head(sk))->fack_count;
 
 	BUG_TRAP(packets <= tp->packets_out);
 	if (tp->lost_skb_hint) {
@@ -2153,9 +2150,18 @@ static void tcp_mark_head_lost(struct sock *sk, int packets, int fast_rexmit)
 		tp->lost_skb_hint = skb;
 		tp->lost_cnt_hint = cnt;
 
-		if (tcp_is_fack(tp) ||
-		    (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED))
-			cnt += tcp_skb_pcount(skb);
+		fc = TCP_SKB_CB(skb)->fack_count;
+		if (tcp_is_fack(tp)) {
+			cnt = fc - fack_count_base + tcp_skb_pcount(skb);
+		} else {
+			if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
+				cnt += tcp_skb_pcount(skb);
+			/* Add SACK blocks between this and skb->prev */
+			if ((skb != tcp_write_queue_head(sk)) &&
+			    !tcp_skb_adjacent(sk, skb->prev, skb))
+				cnt += fc - TCP_SKB_CB(skb->prev)->fack_count -
+						tcp_skb_pcount(skb->prev);
+		}
 
 		if (((!fast_rexmit || (tp->lost_out > 0)) && (cnt > packets)) ||
 		     after(TCP_SKB_CB(skb)->end_seq, tp->high_seq))
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index b81284a..1d83c65 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1208,7 +1208,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
 
 	/* Link BUFF into the send queue. */
 	skb_header_release(buff);
-	tcp_insert_write_queue_after(skb, buff, sk);
+	__tcp_insert_write_queue_after(skb, buff, sk, 0);
 
 	return 0;
 }
@@ -1345,7 +1345,7 @@ static int tcp_mtu_probe(struct sock *sk)
 	nskb->csum = 0;
 	nskb->ip_summed = skb->ip_summed;
 
-	tcp_insert_write_queue_before(nskb, skb, sk);
+	__tcp_insert_write_queue_before(nskb, skb, sk);
 
 	len = 0;
 	tcp_for_write_queue_from_safe(skb, next, sk, 0) {
-- 
1.5.0.6

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ