lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Mon, 25 Jun 2007 09:07:48 -0700
From:	Stephen Hemminger <shemminger@...ux-foundation.org>
To:	OBATA Noboru <noboru.obata.ar@...achi.com>
Cc:	David Miller <davem@...emloft.net>, netdev@...r.kernel.org
Subject: Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable

On Mon, 25 Jun 2007 22:09:39 +0900 (JST)
OBATA Noboru <noboru.obata.ar@...achi.com> wrote:

> From: OBATA Noboru <noboru.obata.ar@...achi.com>
> 
> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> then guarantee TCP retransmission to be more controllable, say,
> at least once per 10 seconds, by setting it to 10.  This is
> quite helpful on failover-capable network devices, such as an
> active-backup bonding device.  On such devices, it is desirable
> that TCP retransmits a packet shortly after the failover, which
> is what I would like to do with this patch.  Please see
> Background and Problem below for rationale in detail.
> 
> Reading from /proc/sys/net/ipv4/tcp_rto_max shows the current
> TCP_RTO_MAX in seconds.  The actual value of TCP_RTO_MAX is
> stored in sysctl_tcp_rto_max in jiffies.
> 
> Writing to /proc/sys/net/ipv4/tcp_rto_max updates the
> TCP_RTO_MAX, only if the new value is not smaller than
> TCP_RTO_MIN, which is currently 0.2[sec].  Since tcp_rto_max is
> an integer, the minimum value of /proc/sys/net/ipv4/tcp_rto_max
> is 1, in substance.  Also the RtoMax entry in /proc/net/snmp is
> updated.
> 
> Please note that this is effective in IPv6 as well.
> 
> 
> Background and Problem
> ======================
> 
> When designing a TCP/IP based network system on failover-capable
> network devices, people want to set timeouts hierarchically in
> three layers, network device layer, TCP layer, and application
> layer (bottom-up order), such that:
> 
> 1. Network device layer detects a failure first and switch to a
>    backup device (say, in 20sec).
> 
> 2. TCP layer timeout & retransmission comes next, _hopefully_
>    before the application layer timeout.
> 
> 3. Application layer detects a network failure last (by, say,
>    30sec timeout) and may trigger a system-level failover.
> 
>    * Note 1.  The timeouts for #1 and #2 are handled
>      independently and there is no relationship between them.
> 
>    * Note 2.  The actual timeout settings (20sec or 30sec in
>      this example) are often determined by systems requirement
>      and so setting them to certain "safe values" (if any) are
>      usually not possible.
> 
> If TCP retransmission misses the time frame between event #1
> and #3 in Background above (between 20 and 30sec since network
> failure), a failure causes the system-level failover where the
> network-device-level failover should be enough.
> 
> The problem in this hierarchical timeout scheme is that TCP
> layer does not guarantee the next retransmission to occur in
> certain period of time.  In the above example, people expect TCP
> to retransmit a packet between 20 and 30sec since network
> failure, but it may not happen.
> 
> Starting from RTO=0.5sec for example, retransmission will occur
> at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
> in the following diagram, but miss the time frame between time
> 20 and 30.
> 
>        time: 0         10        20        30sec
>              |         |         |         |
>   App. layer |---------+---------+---------X  ==> system failover
>    TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
> Netdev layer |---------+---------X            ==> network failover
> 
> 
> Signed-off-by: OBATA Noboru <noboru.obata.ar@...achi.com>
> ---
> 
>  Documentation/networking/ip-sysctl.txt |    6 +
>  include/linux/sysctl.h                 |    1
>  include/net/tcp.h                      |    5 +
>  net/ipv4/sysctl_net_ipv4.c             |   77 +++++++++++++++++++++++++
>  net/ipv4/tcp_timer.c                   |    3
>  5 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> --- a/Documentation/networking/ip-sysctl.txt	2007-06-22 21:34:18.000000000 +0900
> +++ b/Documentation/networking/ip-sysctl.txt	2007-06-25 16:07:21.000000000 +0900
> @@ -340,6 +340,12 @@ tcp_rmem - vector of 3 INTEGERs: min, de
>  	net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
>  	Default: 87380*2 bytes.
>  
> +tcp_rto_max - INTEGER
> +	Maximum time in seconds to which RTO can grow.  Exponential
> +	backoff of RTO is bounded by this value.  The value must not be
> +	smaller than 1.  Note this parameter is also effective for IPv6.
> +	Default: 120
> +
>  tcp_sack - BOOLEAN
>  	Enable select acknowledgments (SACKS).
>  
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/linux/sysctl.h b/include/linux/sysctl.h
> --- a/include/linux/sysctl.h	2007-06-22 21:34:33.000000000 +0900
> +++ b/include/linux/sysctl.h	2007-06-25 16:27:29.000000000 +0900
> @@ -441,6 +441,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_TCP_RTO_MAX=126,
>  };
>  

Rather than assigning another numeric sysctl value, you can use
CTL_UNNUMBERED.  The use of numeric sysctl's is being phased down, at one
point they were even going to be deprecated.


>  enum {
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/net/tcp.h b/include/net/tcp.h
> --- a/include/net/tcp.h	2007-06-22 21:34:33.000000000 +0900
> +++ b/include/net/tcp.h	2007-06-22 21:40:05.000000000 +0900
> @@ -121,7 +121,9 @@ extern void tcp_time_wait(struct sock *s
>  #define TCP_DELACK_MIN	4U
>  #define TCP_ATO_MIN	4U
>  #endif
> -#define TCP_RTO_MAX	((unsigned)(120*HZ))
> +extern int sysctl_tcp_rto_max;
> +#define TCP_RTO_MAX	((unsigned)(sysctl_tcp_rto_max))
> +#define TCP_RTO_MAX_DEFAULT	((unsigned)(120*HZ))
>  #define TCP_RTO_MIN	((unsigned)(HZ/5))
>  #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/

Rather than causing macro TCP_RTO_MAX to reference sysctl_rto_max directly.

> @@ -203,6 +205,7 @@ extern int sysctl_tcp_synack_retries;
>  extern int sysctl_tcp_retries1;
>  extern int sysctl_tcp_retries2;
>  extern int sysctl_tcp_orphan_retries;
> +extern int sysctl_tcp_rto_max;
>  extern int sysctl_tcp_syncookies;
>  extern int sysctl_tcp_retrans_collapse;
>  extern int sysctl_tcp_stdurg;
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> --- a/net/ipv4/sysctl_net_ipv4.c	2007-06-22 21:34:33.000000000 +0900
> +++ b/net/ipv4/sysctl_net_ipv4.c	2007-06-25 16:27:53.000000000 +0900
> @@ -186,6 +186,74 @@ static int strategy_allowed_congestion_c
>  
>  }
>  
> +static int proc_tcp_rto_max(ctl_table *ctl, int write, struct file *filp,
> +			    void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int val = *(int *)ctl->data;
> +	int ret;
> +
> +	ret = proc_dointvec_jiffies(ctl, write, filp, buffer, lenp, ppos);
> +	if (ret)
> +		return ret;
> +
> +	if (write && *(int *)ctl->data != val) {
> +		if (*(int *)ctl->data < TCP_RTO_MIN) {
> +			*(int *)ctl->data = val;
> +			return -EINVAL;
> +		}
> +		TCP_ADD_STATS_USER(TCP_MIB_RTOMAX,
> +				   (*(int *)ctl->data - val) * 1000 / HZ);
> +	}
> +
> +	return 0;
> +}
> +
> +static int strategy_tcp_rto_max(ctl_table *table, int __user *name,
> +				int nlen, void __user *oldval,
> +				size_t __user *oldlenp,
> +				void __user *newval, size_t newlen)
> +{
> +	int *valp = table->data;
> +	int new;
> +
> +	if (!newval || !newlen)
> +		return 0;
> +
> +	if (newlen != sizeof(int))
> +		return -EINVAL;
> +
> +	if (get_user(new, (int __user *)newval))
> +		return -EFAULT;
> +
> +	if (new * HZ == *valp)
> +		return 0;
> +
> +	if (new * HZ < TCP_RTO_MIN)
> +		return -EINVAL;
> +
> +	if (oldval && oldlenp) {
> +		size_t len;
> +
> +		if (get_user(len, oldlenp))
> +			return -EFAULT;
> +
> +		if (len) {
> +			if (len > table->maxlen)
> +				len = table->maxlen;
> +			if (put_user(*valp / HZ, (int __user *)oldval))
> +				return -EFAULT;
> +			if (put_user(len, oldlenp))
> +				return -EFAULT;
> +		}
> +	}
> +
> +	TCP_ADD_STATS_USER(TCP_MIB_RTOMAX, (new * HZ - *valp) * 1000 / HZ);
> +
> +	*valp = new * HZ;
> +
> +	return 1;
> +}

Could sysctl_rto_max be unsigned instead of int to avoid possible sign wrap issues and
having to cast it on each use?

>  ctl_table ipv4_table[] = {
>  	{
>  		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
> @@ -363,6 +431,15 @@ ctl_table ipv4_table[] = {
>  		.proc_handler	= &proc_dointvec
>  	},
>  	{
> +		.ctl_name	= NET_TCP_RTO_MAX,
> +		.procname	= "tcp_rto_max",
> +		.data		= &sysctl_tcp_rto_max,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_tcp_rto_max,
> +		.strategy	= &strategy_tcp_rto_max
> +	},
> +	{
>  		.ctl_name	= NET_IPV4_TCP_FIN_TIMEOUT,
>  		.procname	= "tcp_fin_timeout",
>  		.data		= &sysctl_tcp_fin_timeout,
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> --- a/net/ipv4/tcp_timer.c	2007-06-22 21:34:33.000000000 +0900
> +++ b/net/ipv4/tcp_timer.c	2007-06-22 21:39:35.000000000 +0900
> @@ -31,6 +31,9 @@ int sysctl_tcp_keepalive_intvl __read_mo
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_rto_max __read_mostly = TCP_RTO_MAX_DEFAULT;
> +
> +EXPORT_SYMBOL(sysctl_tcp_rto_max);
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);


-- 
Stephen Hemminger <shemminger@...ux-foundation.org>
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists