lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <396556a20805301217k293e5718h6bbf02bfe0683144@europa>
Date:	Tue, 17 Jun 2008 19:06:51 -0700
From:	"Adam Langley" <agl@...erialviolet.org>
To:	davem@...emloft.net
Cc:	netdev@...r.kernel.org
Subject: Comments requested: Long options and MD5 options

What follows is an incomplete patch to add long options[1] (it's not signed
off, it's not tested, it might compile). I'm looking for comments on a couple
of aspects.

Currently there are a couple of bugs in the MD5 code. Firstly, we'll currently
try to put SACK blocks in with MD5 signatures and timestamps, overflow the
header size and produce corrupt packets. DaveM has suggested[2] that timestamps
should be dropped for MD5 packets since SACK is more important. So here's the
first suggestion:

Since we have a population of hosts which will produce corrupt packets if we
ask them to use SACK with MD5, in the case that we see a SYN packet with MD5 +
SACK + TS, we assume that it's from one of these hosts and reply with a SYNACK
with only MD5 + TS. That will stop them from sending SACKs and corrupting their
packets.

When we send SYNs with MD5 we send MD5 + SACK and we reply to these with MD5 +
SACK. This means that we get SACKs, rather than timestamps on MD5 signed
connections and older hosts are much less likely to corrupt their packets
because they have space for some SACK blocks.

Second suggestion:

The current state of option processing is that the logic is duplicated in
several places, mostly because functions will try and calculate the size of the
options before calling a function to actually write them. Any mistake (as we
had with the MD5 + SACK) is bad.

So I suggest pulling the logic into a single function (or rather, one for
SYN/SYNACK and one for normal packets). Now the question is what this function
should look like. It's possible to write code like:

if (some_option) {
	size += 4;
	if (ptr) {
		*ptr++ = htonl(....);
	}
}

However, all those 'if (ptr)' guards are ugly and I'm afraid that all the
branches won't be too fast. So, in the patch below (see tcp_build_options) I do

if (some_option) {
	*ptr = htonl(...);
	ptr += n;
	size += 4;
}

In the case that we want to calculate the size of the options first, n = 0 and
ptr is pointing at a word on the stack. Otherwise, ptr is pointing into the skb
and n = 1.

How do people like that?

Third:

The patch currently adds long options as well. If I'm also fixing MD5 bugs I
should split the pulling together of option logic in its own patch, right?

I happen to want long options for something I'm working on, but they'll also
allow for more SACK blocks (esp with MD5) in the normal course of things. More
SACK blocks have been found to be good for throughput[3].

It's based on [1], with a couple of small changes:
   * LO options can be anywhere in the doff options space
   * LO options aren't included in packets which don't need them. This means
     that the fast path processing still works.

I've contacted the draft's author about this and he's pretty happy. If I want
to get non-experimental TCP option numbers assigned I'll probably have to take
over the draft and push it forward.

If you made it this far, thanks for reading.

[1] http://tools.ietf.org/html/draft-eddy-tcp-loo-03
[2] http://marc.info/?l=linux-netdev&m=121374770909560&w=2
[3] Srijith, K., Jacob, L., and A. Ananda, "Worst-case Performance
    Limitation of TCP SACK and a Feasible Solution", Proceedings of
    8th IEEE International Conference on Communications Systems
    (ICCS), November 2002

(Note, patch below changes send_check's type. I've fixed DCCP so far, but
haven't got round to any of the others)

---

 include/linux/tcp.h                |    6 
 include/net/inet_connection_sock.h |    1 
 include/net/inet_sock.h            |    3 
 include/net/tcp.h                  |   17 +
 net/dccp/ipv4.c                    |    2 
 net/dccp/output.c                  |    2 
 net/ipv4/Kconfig                   |   10 +
 net/ipv4/tcp_input.c               |   34 +++
 net/ipv4/tcp_ipv4.c                |    8 -
 net/ipv4/tcp_output.c              |  466 ++++++++++++++++++++++--------------
 10 files changed, 361 insertions(+), 188 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 18e62e3..2af2b59 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -222,6 +222,12 @@ struct tcp_options_received {
 	u8	num_sacks;	/* Number of SACK blocks		*/
 	u16	user_mss;  	/* mss requested by user in ioctl */
 	u16	mss_clamp;	/* Maximal mss, negotiated at connection setup */
+#ifdef CONFIG_TCP_LONG_OPTIONS
+	u8 long_options : 1;	/* Was a LO option seen?		*/
+	u16 lo_header_length;	/* This contains the header length, in 4
+				   byte words for the current packet.
+				   This is valid in all cases */
+#endif
 };
 
 struct tcp_request_sock {
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 2ff545a..d6a287c 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -38,6 +38,7 @@ struct tcp_congestion_ops;
 struct inet_connection_sock_af_ops {
 	int	    (*queue_xmit)(struct sk_buff *skb, int ipfragok);
 	void	    (*send_check)(struct sock *sk, int len,
+				  int header_len,
 				  struct sk_buff *skb);
 	int	    (*rebuild_header)(struct sock *sk);
 	int	    (*conn_request)(struct sock *sk, struct sk_buff *skb);
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index a42cd63..a9d28ad 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -72,7 +72,8 @@ struct inet_request_sock {
 				sack_ok	   : 1,
 				wscale_ok  : 1,
 				ecn_ok	   : 1,
-				acked	   : 1;
+				acked	   : 1,
+				long_options : 1;
 	struct ip_options	*opt;
 };
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 633147c..738a3d7 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -154,6 +154,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 					 * timestamps. It must be less than
 					 * minimal timewait lifetime.
 					 */
+
+#define TCP_MAX_OPTION_SPACE	40	/* Max bytes of options */
+
 /*
  *	TCP option
  */
@@ -166,6 +169,9 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOPT_SACK             5       /* SACK Block */
 #define TCPOPT_TIMESTAMP	8	/* Better RTT estimations/PAWS */
 #define TCPOPT_MD5SIG		19	/* MD5 Signature (RFC2385) */
+/* These two are temporarly taking the experimental option numbers */
+#define TCPOPT_LONG_OPTS	253	/* Large options */
+#define TCPOPT_SYN_LONG_OPTS	254	/* Delayed SYN options */
 
 /*
  *     TCP option lengths
@@ -176,6 +182,8 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_PERM      2
 #define TCPOLEN_TIMESTAMP      10
 #define TCPOLEN_MD5SIG         18
+#define TCPOLEN_LONG_OPTS      4
+#define TCPOLEN_SYN_LONG_OPTS  4
 
 /* But this is what stacks really send out. */
 #define TCPOLEN_TSTAMP_ALIGNED		12
@@ -185,6 +193,8 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
 #define TCPOLEN_SACK_BASE_ALIGNED	4
 #define TCPOLEN_SACK_PERBLOCK		8
 #define TCPOLEN_MD5SIG_ALIGNED		20
+#define TCPOLEN_LONG_OPTS_ALIGNED	4
+#define TCPOLEN_SYN_LONG_OPTS_ALIGNED	4
 
 /* Flags in tp->nonagle */
 #define TCP_NAGLE_OFF		1	/* Nagle's algo is disabled */
@@ -334,6 +344,9 @@ extern void tcp_enter_quickack_mode(struct sock *sk);
 static inline void tcp_clear_options(struct tcp_options_received *rx_opt)
 {
  	rx_opt->tstamp_ok = rx_opt->sack_ok = rx_opt->wscale_ok = rx_opt->snd_wscale = 0;
+#ifdef CONFIG_TCP_LONG_OPTIONS
+        rx_opt->long_options = 0;
+#endif
 }
 
 #define	TCP_ECN_OK		1
@@ -404,6 +417,7 @@ extern void			tcp_parse_options(struct sk_buff *skb,
  */
 
 extern void		       	tcp_v4_send_check(struct sock *sk, int len,
+						  int header_len,
 						  struct sk_buff *skb);
 
 extern int			tcp_v4_conn_request(struct sock *sk,
@@ -975,6 +989,9 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->acked = 0;
 	ireq->ecn_ok = 0;
 	ireq->rmt_port = tcp_hdr(skb)->source;
+#ifdef CONFIG_TCP_LONG_OPTIONS
+        ireq->long_options = rx_opt->long_options;
+#endif
 }
 
 extern void tcp_enter_memory_pressure(void);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index c22a378..3d3e0ba 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -344,7 +344,7 @@ static inline __sum16 dccp_v4_csum_finish(struct sk_buff *skb,
 	return csum_tcpudp_magic(src, dst, skb->len, IPPROTO_DCCP, skb->csum);
 }
 
-void dccp_v4_send_check(struct sock *sk, int unused, struct sk_buff *skb)
+void dccp_v4_send_check(struct sock *sk, int unused, int unused, struct sk_buff *skb)
 {
 	const struct inet_sock *inet = inet_sk(sk);
 	struct dccp_hdr *dh = dccp_hdr(skb);
diff --git a/net/dccp/output.c b/net/dccp/output.c
index 1f8a9b6..718db34 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -119,7 +119,7 @@ static int dccp_transmit_skb(struct sock *sk, struct sk_buff *skb)
 			break;
 		}
 
-		icsk->icsk_af_ops->send_check(sk, 0, skb);
+		icsk->icsk_af_ops->send_check(sk, 0, 0, skb);
 
 		if (set_ack)
 			dccp_event_ack_sent(sk);
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 4670683..c8a4a90 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -632,5 +632,15 @@ config TCP_MD5SIG
 
 	  If unsure, say N.
 
+config TCP_LONG_OPTIONS
+	bool "TCP: Long options support (EXPERIMENTAL)"
+	depends on EXPERIMENTAL
+	---help---
+	  This enables support for oversized TCP options, as detailed in
+	  draft-eddy-tcp-loo-03. Long options might be required for future TCP
+	  extensions and currently allow for additional SACK blocks (which is
+	  known to be helpful. Also, for MD5 signed packets, no SACK blocks can
+	  be included without this.
+
 source "net/ipv4/ipvs/Kconfig"
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index b54d9d3..0086b3f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3329,6 +3329,10 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 	struct tcphdr *th = tcp_hdr(skb);
 	int length = (th->doff * 4) - sizeof(struct tcphdr);
 
+#ifdef CONFIG_TCP_LONG_OPTIONS
+        opt_rx->lo_header_length = th->doff;
+#endif
+
 	ptr = (unsigned char *)(th + 1);
 	opt_rx->saw_tstamp = 0;
 
@@ -3407,6 +3411,19 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 				 */
 				break;
 #endif
+#ifdef CONFIG_TCP_LONG_OPTIONS
+			case TCPOPT_LONG_OPTS:
+				if (opsize == TCPOLEN_LONG_OPTS) {
+					u16 a = get_unaligned_be16(ptr);
+					if (a >= th->doff &&
+					    a << 2 <= skb->len) {
+						  unsigned delta = (a - th->doff) * 4;
+						  length += delta;
+						  TCP_SKB_CB(skb)->end_seq -= delta;
+						  opt_rx->lo_header_length = a;
+					}
+				}
+#endif
 			}
 
 			ptr += opsize-2;
@@ -3421,6 +3438,9 @@ void tcp_parse_options(struct sk_buff *skb, struct tcp_options_received *opt_rx,
 static int tcp_fast_parse_options(struct sk_buff *skb, struct tcphdr *th,
 				  struct tcp_sock *tp)
 {
+#ifdef CONFIG_TCP_LONG_OPTIONS
+	tp->rx_opt.lo_header_length = th->doff;
+#endif
 	if (th->doff == sizeof(struct tcphdr) >> 2) {
 		tp->rx_opt.saw_tstamp = 0;
 		return 0;
@@ -3896,13 +3916,20 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
 		goto drop;
 
+#ifdef CONFIG_TCP_LONG_OPTIONS
+	__skb_pull(skb, tp->rx_opt.lo_header_length * 4);
+#else
 	__skb_pull(skb, th->doff * 4);
+#endif
 
 	TCP_ECN_accept_cwr(tp, skb);
 
 	if (tp->rx_opt.dsack) {
 		tp->rx_opt.dsack = 0;
 		tp->rx_opt.eff_sacks = min_t(unsigned int, tp->rx_opt.num_sacks,
+#ifdef CONFIG_TCP_LONG_OPTIONS
+					     5 * tp->rx_opt.long_options +
+#endif
 					     4 - tp->rx_opt.tstamp_ok);
 	}
 
@@ -4517,7 +4544,12 @@ static void tcp_urg(struct sock *sk, struct sk_buff *skb, struct tcphdr *th)
 
 	/* Do we wait for any urgent data? - normally not... */
 	if (tp->urg_data == TCP_URG_NOTYET) {
-		u32 ptr = tp->urg_seq - ntohl(th->seq) + (th->doff * 4) -
+#ifdef CONFIG_TCP_LONG_OPTIONS
+		const u32 data_off = tp->rx_opt.lo_header_length * 4;
+#else
+		const u32 data_off = th->doff * 4;
+#endif
+		u32 ptr = tp->urg_seq - ntohl(th->seq) + data_off -
 			  th->syn;
 
 		/* Is the urgent pointer pointing into this packet? */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cd601a8..a6eaf6b 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -88,7 +88,8 @@ int sysctl_tcp_low_latency __read_mostly;
 /* Check TCP sequence numbers in ICMP packets. */
 #define ICMP_MIN_LENGTH 8
 
-void tcp_v4_send_check(struct sock *sk, int len, struct sk_buff *skb);
+void tcp_v4_send_check(struct sock *sk, int len, int header_len,
+                       struct sk_buff *skb);
 
 #ifdef CONFIG_TCP_MD5SIG
 static struct tcp_md5sig_key *tcp_v4_md5_do_lookup(struct sock *sk,
@@ -481,7 +482,8 @@ out:
 }
 
 /* This routine computes an IPv4 TCP checksum. */
-void tcp_v4_send_check(struct sock *sk, int len, struct sk_buff *skb)
+void tcp_v4_send_check(struct sock *sk, int len, int header_len,
+                       struct sk_buff *skb)
 {
 	struct inet_sock *inet = inet_sk(sk);
 	struct tcphdr *th = tcp_hdr(skb);
@@ -494,7 +496,7 @@ void tcp_v4_send_check(struct sock *sk, int len, struct sk_buff *skb)
 	} else {
 		th->check = tcp_v4_check(len, inet->saddr, inet->daddr,
 					 csum_partial((char *)th,
-						      th->doff << 2,
+						      header_len,
 						      skb->csum));
 	}
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e399bde..1c7d39a 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -61,6 +61,10 @@ int sysctl_tcp_base_mss __read_mostly = 512;
 /* By default, RFC2861 behavior.  */
 int sysctl_tcp_slow_start_after_idle __read_mostly = 1;
 
+#ifdef CONFIG_TCP_LONG_OPTIONS
+int sysctl_tcp_long_options __read_mostly = 1;
+#endif
+
 static void tcp_event_new_data_sent(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -347,112 +351,218 @@ static void tcp_init_nondata_skb(struct sk_buff *skb, u32 seq, u8 flags)
 	TCP_SKB_CB(skb)->end_seq = seq;
 }
 
-static void tcp_build_and_update_options(__be32 *ptr, struct tcp_sock *tp,
-					 __u32 tstamp, __u8 **md5_hash)
-{
+/* Build, or calculate the size of the TCP options for a SYN or SYNACK.
+ *   ptr: (maybe NULL) Location to write the options
+ *   synack: if non-zero, this is a SYNACK, otherwise a SYN.
+ *   md5_hash: If non-NULL, space is left for a MD5 signature the location is
+ *     written here
+ *   prev_written: the return value of calling this function, immediately
+ *     prior. This is used to know if long options are needed. First time
+ *     round, pass 0 here. */
+static unsigned tcp_build_syn_options(__be32 *ptr,
+				      char synack, int mss, int ts, int sack,
+				      int longopts, int offer_wscale,
+				      int wscale, __u32 tstamp, __u32 ts_recent,
+				      __u8 **md5_hash, unsigned prev_written) {
+	/* Without long options, no SACK blocks can fit in a packet with
+	 * timestamps and MD5 signatures. Older Linux kernels have a bug where
+	 * they try anywhere and corrupt the packets. We want to assuage this
+	 * so if we saw a SYN packet with MD5 + SACK + TS, we assume it an old
+	 * kernel and reply with MD5 + TS. However, SACK is more important than
+	 * TS so we send SYNs with MD5 + SACK. If the remote end supports long
+	 * options we can enable TS later */
+	const char doing_sack = sack && (!synack || !(md5_hash && sack && ts));
+	const char doing_ts = ts && !(md5_hash && doing_sack);
+
+	unsigned written = 0;
+	__be32 temp;
+	const unsigned n = !ptr ? 0 : 1;
+	BUG_ON(prev_written == 0 && ptr);
+	if (!ptr)
+		ptr = &temp;
+
+	if (prev_written >= TCP_MAX_OPTION_SPACE) {
+		BUG_ON(!longopts);
+		BUG_ON(!synack);
+		*ptr = htonl((TCPOPT_LONG_OPTS << 24) |
+			     (TCPOLEN_LONG_OPTS << 16) |
+			     (sizeof(struct tcphdr) + prev_written) >> 2);
+		ptr += n;
+		written += 4;
+	}
+
+	if (md5_hash) {
+		*ptr = htonl((TCPOPT_NOP << 24) |
+			     (TCPOPT_NOP << 16) |
+			     (TCPOPT_MD5SIG << 8) |
+			     TCPOLEN_MD5SIG);
+		ptr += n;
+		if (md5_hash)
+			*md5_hash = (__u8 *) ptr;
+		ptr += n * 4;
+		written += 4 + 16;
+	}
+
+	*ptr = htonl((TCPOPT_MSS << 24) |
+		     (TCPOLEN_MSS << 16) |
+		     mss);
+	ptr += n;
+	written += 4;
+
+	if (doing_ts) {
+		if (doing_sack) {
+			*ptr = htonl((TCPOPT_SACK_PERM << 24) |
+				(TCPOLEN_SACK_PERM << 16) |
+				(TCPOPT_TIMESTAMP << 8) |
+				TCPOLEN_TIMESTAMP);
+			ptr += n;
+		} else {
+			*ptr = htonl((TCPOPT_NOP << 24) |
+				(TCPOPT_NOP << 16) |
+				(TCPOPT_TIMESTAMP << 8) |
+				TCPOLEN_TIMESTAMP);
+			ptr += n;
+		}
+
+		*ptr = htonl(tstamp);
+		ptr += n;
+		*ptr = htonl(ts_recent);
+		ptr += n;
+		written += 4 * 3;
+	} else if (doing_sack) {
+		*ptr = htonl((TCPOPT_NOP << 24) |
+			     (TCPOPT_NOP << 16) |
+			     (TCPOPT_SACK_PERM << 8) |
+			     (TCPOLEN_SACK_PERM));
+		ptr += n;
+		written += 4;
+	}
+	if (offer_wscale) {
+		*ptr = htonl((TCPOPT_NOP << 24) |
+			     (TCPOPT_WINDOW << 16) |
+			     (TCPOLEN_WINDOW << 8) |
+			     wscale);
+		ptr += n;
+		written += 4;
+	}
+
+	if (prev_written < TCP_MAX_OPTION_SPACE &&
+	    longopts && written <= TCP_MAX_OPTION_SPACE - 4) {
+	      written += 4;
+	      *ptr = htonl((TCPOPT_LONG_OPTS << 24) |
+			   (TCPOLEN_LONG_OPTS << 16) |
+			   ((written + sizeof(struct tcphdr)) << 2));
+	      ptr += n;
+	}
+
+	return written;
+}
+
+/* Build, or calculate the size of the TCP options.
+ *   ptr: (maybe NULL) Location to write the options
+ *   md5_hash: If non-NULL, space is left for a MD5 signature the location is
+ *     written here
+ *   prev_written: the return value of calling this function, immediately
+ *     prior. This is used to know if long options are needed. First time
+ *     round, pass 0 here. */
+static unsigned tcp_build_options(__be32 *ptr, const struct sock *sk,
+				  __u32 tstamp,
+				  __u8 **md5_hash, unsigned prev_written) {
+	struct tcp_sock *tp = tcp_sk(sk);
+	const unsigned sack_bytes =
+		TCPOLEN_SACK_BASE + (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK);
+	unsigned written = 0;
+#ifdef CONFIG_TCP_LONG_OPTIONS
+	const char long_options = tp->rx_opt.long_options;
+#else
+	const char long_options = 0;
+#endif
+	__be32 temp;
+	const unsigned n = !ptr ? 0 : 1;
+	BUG_ON(prev_written == 0 && ptr);
+	if (!ptr)
+		ptr = &temp;
+
+	if (prev_written > TCP_MAX_OPTION_SPACE) {
+		BUG_ON(!long_options);
+		/* This only occurs when long options are possible. In which
+		 * case we need to write out the option now */
+		*ptr = htonl((TCPOPT_LONG_OPTS << 24) |
+			     (TCPOLEN_LONG_OPTS << 16) |
+			     (sizeof(struct tcphdr) + prev_written) >> 2);
+		ptr += n;
+		written += n;
+	}
+
+	if (md5_hash) {
+		*ptr = htonl((TCPOPT_NOP << 24) |
+			     (TCPOPT_NOP << 16) |
+			     (TCPOPT_MD5SIG << 8) |
+			     TCPOLEN_MD5SIG);
+		ptr += n;
+		*md5_hash = (__u8 *) ptr;
+		ptr += n * 4;
+		written += 4 + 16;
+	}
+
 	if (tp->rx_opt.tstamp_ok) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_TIMESTAMP << 8) |
-			       TCPOLEN_TIMESTAMP);
-		*ptr++ = htonl(tstamp);
-		*ptr++ = htonl(tp->rx_opt.ts_recent);
-	}
-	if (tp->rx_opt.eff_sacks) {
-		struct tcp_sack_block *sp = tp->rx_opt.dsack ? tp->duplicate_sack : tp->selective_acks;
+		*ptr = htonl((TCPOPT_NOP << 24) |
+			     (TCPOPT_NOP << 16) |
+			     (TCPOPT_TIMESTAMP << 8) |
+			     TCPOLEN_TIMESTAMP);
+		ptr += n;
+		*ptr = htonl(tstamp);
+		ptr += n;
+		*ptr = htonl(tp->rx_opt.ts_recent);
+		ptr += n;
+
+		written += 4 * 3;
+	}
+	
+	if (tp->rx_opt.eff_sacks &&
+	    (long_options || TCP_MAX_OPTION_SPACE - written >= sack_bytes)) {
+		const struct tcp_sack_block *sp =
+		  tp->rx_opt.dsack ? tp->duplicate_sack : tp->selective_acks;
 		int this_sack;
 
-		*ptr++ = htonl((TCPOPT_NOP  << 24) |
-			       (TCPOPT_NOP  << 16) |
-			       (TCPOPT_SACK <<  8) |
-			       (TCPOLEN_SACK_BASE + (tp->rx_opt.eff_sacks *
-						     TCPOLEN_SACK_PERBLOCK)));
+		*ptr = htonl((TCPOPT_NOP  << 24) |
+			     (TCPOPT_NOP  << 16) |
+			     (TCPOPT_SACK <<  8) |
+			     sack_bytes);
+		ptr += n;
 
 		for (this_sack = 0; this_sack < tp->rx_opt.eff_sacks; this_sack++) {
-			*ptr++ = htonl(sp[this_sack].start_seq);
-			*ptr++ = htonl(sp[this_sack].end_seq);
+			*ptr = htonl(sp[this_sack].start_seq);
+			ptr += n;
+			*ptr = htonl(sp[this_sack].end_seq);
+			ptr += n;
 		}
 
-		if (tp->rx_opt.dsack) {
+		if (prev_written && tp->rx_opt.dsack) {
 			tp->rx_opt.dsack = 0;
 			tp->rx_opt.eff_sacks--;
 		}
+
+		written += sack_bytes;
 	}
-#ifdef CONFIG_TCP_MD5SIG
-	if (md5_hash) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_MD5SIG << 8) |
-			       TCPOLEN_MD5SIG);
-		*md5_hash = (__u8 *)ptr;
+
+	if (prev_written == 0 && written > TCP_MAX_OPTION_SPACE) {
+		/* We are calculating the size of the options space, long
+		 * options are possible and we need them include everything
+		 * that we need, so account for the LO option */
+		written += 4;
 	}
-#endif
+
+	return written;
 }
 
-/* Construct a tcp options header for a SYN or SYN_ACK packet.
- * If this is every changed make sure to change the definition of
- * MAX_SYN_SIZE to match the new maximum number of options that you
- * can generate.
- *
- * Note - that with the RFC2385 TCP option, we make room for the
- * 16 byte MD5 hash. This will be filled in later, so the pointer for the
- * location to be filled is passed back up.
- */
-static void tcp_syn_build_options(__be32 *ptr, int mss, int ts, int sack,
-				  int offer_wscale, int wscale, __u32 tstamp,
-				  __u32 ts_recent, __u8 **md5_hash)
-{
-	/* We always get an MSS option.
-	 * The option bytes which will be seen in normal data
-	 * packets should timestamps be used, must be in the MSS
-	 * advertised.  But we subtract them from tp->mss_cache so
-	 * that calculations in tcp_sendmsg are simpler etc.
-	 * So account for this fact here if necessary.  If we
-	 * don't do this correctly, as a receiver we won't
-	 * recognize data packets as being full sized when we
-	 * should, and thus we won't abide by the delayed ACK
-	 * rules correctly.
-	 * SACKs don't matter, we never delay an ACK when we
-	 * have any of those going out.
-	 */
-	*ptr++ = htonl((TCPOPT_MSS << 24) | (TCPOLEN_MSS << 16) | mss);
-	if (ts) {
-		if (sack)
-			*ptr++ = htonl((TCPOPT_SACK_PERM << 24) |
-				       (TCPOLEN_SACK_PERM << 16) |
-				       (TCPOPT_TIMESTAMP << 8) |
-				       TCPOLEN_TIMESTAMP);
-		else
-			*ptr++ = htonl((TCPOPT_NOP << 24) |
-				       (TCPOPT_NOP << 16) |
-				       (TCPOPT_TIMESTAMP << 8) |
-				       TCPOLEN_TIMESTAMP);
-		*ptr++ = htonl(tstamp);		/* TSVAL */
-		*ptr++ = htonl(ts_recent);	/* TSECR */
-	} else if (sack)
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_SACK_PERM << 8) |
-			       TCPOLEN_SACK_PERM);
-	if (offer_wscale)
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_WINDOW << 16) |
-			       (TCPOLEN_WINDOW << 8) |
-			       (wscale));
-#ifdef CONFIG_TCP_MD5SIG
-	/*
-	 * If MD5 is enabled, then we set the option, and include the size
-	 * (always 18). The actual MD5 hash is added just before the
-	 * packet is sent.
-	 */
-	if (md5_hash) {
-		*ptr++ = htonl((TCPOPT_NOP << 24) |
-			       (TCPOPT_NOP << 16) |
-			       (TCPOPT_MD5SIG << 8) |
-			       TCPOLEN_MD5SIG);
-		*md5_hash = (__u8 *)ptr;
-	}
-#endif
+/* Return the doff value for a given header size. Normally this will just be
+ * 1/4 of the number of bytes, but if large options are in play we set the doff
+ * value to cover just the LO option (which is always first) */
+static unsigned tcp_doff_size(unsigned header_size) {
+  return header_size <= TCP_MAX_OPTION_SPACE + sizeof(struct tcphdr) ?
+	 header_size >> 2 :
+	 6;
 }
 
 /* This routine actually transmits TCP packets queued in by
@@ -473,11 +583,9 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	struct inet_sock *inet;
 	struct tcp_sock *tp;
 	struct tcp_skb_cb *tcb;
-	int tcp_header_size;
-#ifdef CONFIG_TCP_MD5SIG
-	struct tcp_md5sig_key *md5;
+	unsigned tcp_options_size, tcp_header_size;
+	struct tcp_md5sig_key *md5 = NULL;
 	__u8 *md5_hash_location;
-#endif
 	struct tcphdr *th;
 	int sysctl_flags;
 	int err;
@@ -502,50 +610,38 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	inet = inet_sk(sk);
 	tp = tcp_sk(sk);
 	tcb = TCP_SKB_CB(skb);
-	tcp_header_size = tp->tcp_header_len;
 
-#define SYSCTL_FLAG_TSTAMPS	0x1
-#define SYSCTL_FLAG_WSCALE	0x2
-#define SYSCTL_FLAG_SACK	0x4
+#ifdef CONFIG_TCP_MD5SIG
+	md5 = tp->af_specific->md5_lookup(sk, sk);
+#endif
 
 	sysctl_flags = 0;
 	if (unlikely(tcb->flags & TCPCB_FLAG_SYN)) {
-		tcp_header_size = sizeof(struct tcphdr) + TCPOLEN_MSS;
-		if (sysctl_tcp_timestamps) {
-			tcp_header_size += TCPOLEN_TSTAMP_ALIGNED;
-			sysctl_flags |= SYSCTL_FLAG_TSTAMPS;
-		}
-		if (sysctl_tcp_window_scaling) {
-			tcp_header_size += TCPOLEN_WSCALE_ALIGNED;
-			sysctl_flags |= SYSCTL_FLAG_WSCALE;
-		}
-		if (sysctl_tcp_sack) {
-			sysctl_flags |= SYSCTL_FLAG_SACK;
-			if (!(sysctl_flags & SYSCTL_FLAG_TSTAMPS))
-				tcp_header_size += TCPOLEN_SACKPERM_ALIGNED;
-		}
-	} else if (unlikely(tp->rx_opt.eff_sacks)) {
-		/* A SACK is 2 pad bytes, a 2 byte header, plus
-		 * 2 32-bit sequence numbers for each SACK block.
-		 */
-		tcp_header_size += (TCPOLEN_SACK_BASE_ALIGNED +
-				    (tp->rx_opt.eff_sacks *
-				     TCPOLEN_SACK_PERBLOCK));
+		tcp_options_size =
+		  tcp_build_syn_options(NULL, 0,
+					tcp_advertise_mss(sk),
+					sysctl_tcp_timestamps,
+					sysctl_tcp_sack,
+#ifdef CONFIG_TCP_LONG_OPTIONS
+					sysctl_tcp_long_options,
+#else
+					0
+#endif
+					sysctl_tcp_window_scaling,
+					tp->rx_opt.rcv_wscale,
+					tcb->when,
+					tp->rx_opt.ts_recent,
+					md5 ? &md5_hash_location : NULL, 0);
+	} else {
+		tcp_options_size =
+		  tcp_build_options(NULL, sk, tcb->when,
+				    md5 ? &md5_hash_location : NULL, 0);
 	}
+	tcp_header_size = tcp_options_size + 20;
 
 	if (tcp_packets_in_flight(tp) == 0)
 		tcp_ca_event(sk, CA_EVENT_TX_START);
 
-#ifdef CONFIG_TCP_MD5SIG
-	/*
-	 * Are we doing MD5 on this segment? If so - make
-	 * room for it.
-	 */
-	md5 = tp->af_specific->md5_lookup(sk, sk);
-	if (md5)
-		tcp_header_size += TCPOLEN_MD5SIG_ALIGNED;
-#endif
-
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
 	skb_set_owner_w(skb, sk);
@@ -556,7 +652,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	th->dest		= inet->dport;
 	th->seq			= htonl(tcb->seq);
 	th->ack_seq		= htonl(tp->rcv_nxt);
-	*(((__be16 *)th) + 6)	= htons(((tcp_header_size >> 2) << 12) |
+	*(((__be16 *)th) + 6)	= htons((tcp_doff_size(tcp_header_size) << 12) |
 					tcb->flags);
 
 	if (unlikely(tcb->flags & TCPCB_FLAG_SYN)) {
@@ -577,26 +673,25 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	}
 
 	if (unlikely(tcb->flags & TCPCB_FLAG_SYN)) {
-		tcp_syn_build_options((__be32 *)(th + 1),
+		tcp_build_syn_options((__be32 *)(th + 1), 0,
 				      tcp_advertise_mss(sk),
-				      (sysctl_flags & SYSCTL_FLAG_TSTAMPS),
-				      (sysctl_flags & SYSCTL_FLAG_SACK),
-				      (sysctl_flags & SYSCTL_FLAG_WSCALE),
+				      sysctl_tcp_timestamps,
+				      sysctl_tcp_sack,
+#ifdef CONFIG_TCP_LONG_OPTIONS
+				      sysctl_tcp_long_options,
+#else
+				      0
+#endif
+				      sysctl_tcp_window_scaling,
 				      tp->rx_opt.rcv_wscale,
 				      tcb->when,
 				      tp->rx_opt.ts_recent,
-
-#ifdef CONFIG_TCP_MD5SIG
-				      md5 ? &md5_hash_location :
-#endif
-				      NULL);
+				      md5 ? &md5_hash_location : NULL,
+				      tcp_options_size);
 	} else {
-		tcp_build_and_update_options((__be32 *)(th + 1),
-					     tp, tcb->when,
-#ifdef CONFIG_TCP_MD5SIG
-					     md5 ? &md5_hash_location :
-#endif
-					     NULL);
+		tcp_build_options((__be32 *)(th + 1), sk, tcb->when,
+				  md5 ? &md5_hash_location : NULL,
+				  tcp_options_size);
 		TCP_ECN_send(sk, skb, tcp_header_size);
 	}
 
@@ -612,7 +707,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	}
 #endif
 
-	icsk->icsk_af_ops->send_check(sk, skb->len, skb);
+	icsk->icsk_af_ops->send_check(sk, skb->len, tcp_header_size, skb);
 
 	if (likely(tcb->flags & TCPCB_FLAG_ACK))
 		tcp_event_ack_sent(sk, tcp_skb_pcount(skb));
@@ -630,10 +725,6 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
 	tcp_enter_cwr(sk, 1);
 
 	return net_xmit_eval(err);
-
-#undef SYSCTL_FLAG_TSTAMPS
-#undef SYSCTL_FLAG_WSCALE
-#undef SYSCTL_FLAG_SACK
 }
 
 /* This routine just queue's the buffer
@@ -974,6 +1065,8 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
 	u32 mss_now;
 	u16 xmit_size_goal;
 	int doing_tso = 0;
+	unsigned header_len;
+	__u8 **md5_ptr = NULL, *md5_hash;
 
 	mss_now = tp->mss_cache;
 
@@ -986,22 +1079,29 @@ unsigned int tcp_current_mss(struct sock *sk, int large_allowed)
 			mss_now = tcp_sync_mss(sk, mtu);
 	}
 
-	if (tp->rx_opt.eff_sacks)
-		mss_now -= (TCPOLEN_SACK_BASE_ALIGNED +
-			    (tp->rx_opt.eff_sacks * TCPOLEN_SACK_PERBLOCK));
-
 #ifdef CONFIG_TCP_MD5SIG
 	if (tp->af_specific->md5_lookup(sk, sk))
-		mss_now -= TCPOLEN_MD5SIG_ALIGNED;
+		md5_ptr = &md5_hash;
 #endif
 
+	header_len = tcp_build_options(NULL, sk, 0, md5_ptr, 0) +
+		     sizeof(struct tcphdr);
+	/* The mss_cache is sized based on tp->tcp_header_len, which assumes
+	 * some common options. If this is an odd packet (because we have SACK
+	 * blocks etc) then our calculated header_len will be different, and
+	 * we have to adjust mss_now correspondingly */
+	if (header_len != tp->tcp_header_len) {
+		int delta = (int) header_len - tp->tcp_header_len;
+		mss_now -= delta;
+	}
+
 	xmit_size_goal = mss_now;
 
 	if (doing_tso) {
 		xmit_size_goal = ((sk->sk_gso_max_size - 1) -
 				  inet_csk(sk)->icsk_af_ops->net_header_len -
 				  inet_csk(sk)->icsk_ext_hdr_len -
-				  tp->tcp_header_len);
+				  header_len);
 
 		xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
 		xmit_size_goal -= (xmit_size_goal % mss_now);
@@ -2177,12 +2277,10 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	struct inet_request_sock *ireq = inet_rsk(req);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcphdr *th;
-	int tcp_header_size;
+	int tcp_options_size, tcp_header_size;
 	struct sk_buff *skb;
-#ifdef CONFIG_TCP_MD5SIG
-	struct tcp_md5sig_key *md5;
+	struct tcp_md5sig_key *md5 = NULL;
 	__u8 *md5_hash_location;
-#endif
 
 	skb = sock_wmalloc(sk, MAX_TCP_HEADER + 15, 1, GFP_ATOMIC);
 	if (skb == NULL)
@@ -2193,18 +2291,23 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 
 	skb->dst = dst_clone(dst);
 
-	tcp_header_size = (sizeof(struct tcphdr) + TCPOLEN_MSS +
-			   (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0) +
-			   (ireq->wscale_ok ? TCPOLEN_WSCALE_ALIGNED : 0) +
-			   /* SACK_PERM is in the place of NOP NOP of TS */
-			   ((ireq->sack_ok && !ireq->tstamp_ok) ? TCPOLEN_SACKPERM_ALIGNED : 0));
-
 #ifdef CONFIG_TCP_MD5SIG
-	/* Are we doing MD5 on this segment? If so - make room for it */
 	md5 = tcp_rsk(req)->af_specific->md5_lookup(sk, req);
-	if (md5)
-		tcp_header_size += TCPOLEN_MD5SIG_ALIGNED;
 #endif
+
+	tcp_options_size =
+	  tcp_build_syn_options(NULL, 1, dst_metric(dst, RTAX_ADVMSS),
+				ireq->tstamp_ok, ireq->sack_ok,
+#ifdef CONFIG_TCP_LONG_OPTIONS
+				ireq->long_options,
+#else
+				0
+#endif
+				ireq->wscale_ok, ireq->rcv_wscale,
+				TCP_SKB_CB(skb)->when, req->ts_recent,
+				md5 ? &md5_hash_location : NULL, 0);
+	tcp_header_size = tcp_options_size + sizeof(struct tcphdr);
+
 	skb_push(skb, tcp_header_size);
 	skb_reset_transport_header(skb);
 
@@ -2244,18 +2347,19 @@ struct sk_buff *tcp_make_synack(struct sock *sk, struct dst_entry *dst,
 	else
 #endif
 	TCP_SKB_CB(skb)->when = tcp_time_stamp;
-	tcp_syn_build_options((__be32 *)(th + 1), dst_metric(dst, RTAX_ADVMSS), ireq->tstamp_ok,
-			      ireq->sack_ok, ireq->wscale_ok, ireq->rcv_wscale,
-			      TCP_SKB_CB(skb)->when,
-			      req->ts_recent,
-			      (
-#ifdef CONFIG_TCP_MD5SIG
-			       md5 ? &md5_hash_location :
+	tcp_build_syn_options((__be32 *)(th + 1), 1, dst_metric(dst, RTAX_ADVMSS),
+			      ireq->tstamp_ok, ireq->sack_ok,
+#ifdef CONFIG_TCP_LONG_OPTIONS
+			      ireq->long_options,
+#else
+			      0
 #endif
-			       NULL)
-			      );
+			      ireq->wscale_ok, ireq->rcv_wscale,
+			      TCP_SKB_CB(skb)->when, req->ts_recent,
+			      md5 ? &md5_hash_location : NULL,
+			      tcp_options_size);
 
-	th->doff = (tcp_header_size >> 2);
+	th->doff = tcp_doff_size(tcp_header_size);
 	TCP_INC_STATS(TCP_MIB_OUTSEGS);
 
 #ifdef CONFIG_TCP_MD5SIG
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ