linux-kernel - Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3f856638-52bc-6630-a3d2-2b50f1bf7e48@gmail.com>
Date:   Tue, 10 Jul 2018 05:38:55 -0700
From:   Eric Dumazet <eric.dumazet@...il.com>
To:     Jon Maxwell <jmaxwell37@...il.com>, davem@...emloft.net
Cc:     edumazet@...gle.com, eric.dumazet@...il.com, ncardwell@...gle.com,
        David.Laight@...lab.com, kuznet@....inr.ac.ru,
        yoshfuji@...ux-ipv6.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, jmaxwell@...hat.com
Subject: Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy



On 07/09/2018 11:51 PM, Jon Maxwell wrote:
> v3 contains the following suggestions by Neal Cardwell:
> 
> 1) Fix up units mismatch regarding msec/jiffies.
> 2) Address possiblility of time_remaining being negative.
> 3) Add a helper routine tcp_clamp_rto_to_user_timeout() to do the rto 
> calculation.
> 4) Move start_ts logic into helper routine tcp_retrans_stamp() to 
> validate tcp_sk(sk)->retrans_stamp.
> 5) Some u32 declation and return refactoring.
> 6) Return 0 instead of false in tcp_retransmit_stamp(), it's not a bool.
> 
> Suggestions by David Laight:
> 
> 1) Don't cache rto in tcp_clamp_rto_to_user_timeout().
> 2) Use conditional operator instead of min_t() in 
> tcp_clamp_rto_to_user_timeout()
> 
> Changes:
> 
> 1) Call tcp_clamp_rto_to_user_timeout(sk) as an argument to 
> inet_csk_reset_xmit_timer() to save on rto declaration.
> 
> Every time the TCP retransmission timer fires. It checks to see if there is a 
> timeout before scheduling the next retransmit timer. The retransmit interval 
> between each retransmission increases exponentially. The issue is that in order 
> for the timeout to occur the retransmit timer needs to fire again. If the user 
> timeout check happens after the 9th retransmit for example. It needs to wait for 
> the 10th retransmit timer to fire in order to evaluate whether a timeout has 
> occurred or not. If the interval is large enough then the timeout will be 
> inaccurate.
> 
> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
> 
> 1st retransmit:
> 
> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
> 
> Last retransmit:
> 
> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
> 
> Timeout:
> 
> send: Connection timed out
> Sun Jul  1 22:25:34 EDT 2018
> 
> We can see that last retransmit took ~7 seconds. Which pushed the total 
> timeout to ~15 seconds instead of the expected 10 seconds. This gets more 
> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
> 
> Add tcp_clamp_rto_to_user_timeout() to determine if the user rto has expired.
> Or whether the rto interval needs to be recalculated. Use the original interval
> if user rto is not set. 
> 
> Test results with the patch is the expected 10 second timeout:
> 
> 1st retransmit:
> 
> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
> 
> Last retransmit:
> 
> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
> 
> Timeout:
> 
> send: Connection timed out
> Mon Jul  2 01:38:09 EDT 2018
> 
> Signed-off-by: Jon Maxwell <jmaxwell37@...il.com>
> ---
>  net/ipv4/tcp_timer.c | 49 +++++++++++++++++++++++++++++++++++++++----------
>  1 file changed, 39 insertions(+), 10 deletions(-)
> 
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928..93239e58776d 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -22,6 +22,38 @@
>  #include <linux/gfp.h>
>  #include <net/tcp.h>
>  
> +u32 tcp_retransmit_stamp(struct sock *sk)

const struct sock *sk;

(To clearly express the fact this helper does not touch the socket)

> +{
> +	u32 start_ts = tcp_sk(sk)->retrans_stamp;
> +
> +	if (unlikely(!start_ts)) {
> +		struct sk_buff *head = tcp_rtx_queue_head(sk);
> +
> +		if (!head)
> +			return 0;
> +		start_ts = tcp_skb_timestamp(head);
> +	}
> +	return start_ts;
> +}
> +
> +static __u32 tcp_clamp_rto_to_user_timeout(struct sock *sk)

const struct sock *sk

> +{
> +	struct inet_connection_sock *icsk = inet_csk(sk);
> +	__u32 elapsed, user_timeout;
> +	u32 start_ts;

Mixing __u32 and u32 in new code is confusing.

__u32 are needed in uapi include files, not in the C kernel code.

	u32 elapsed, user_timeout, start_ts;

> +
> +	start_ts = tcp_retransmit_stamp(sk);
> +	if (!icsk->icsk_user_timeout || !start_ts)
> +		return icsk->icsk_rto;
> +	elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts;
> +	user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);

Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
we also could change the convention and store msecs in this field instead of jiffies.

That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.

(That would be done in a patch of its own, of course)

> +	if (elapsed >= user_timeout)
> +		return 1; /* user timeout has passed; fire ASAP */
> +	else
> +		return (icsk->icsk_rto < msecs_to_jiffies(user_timeout - elapsed)) ?
> +			icsk->icsk_rto : msecs_to_jiffies(user_timeout - elapsed);


return min_t(i32, icsk->icsk_rto, msecs_to_jiffies(user_timeout - elapsed));

> +}
> +
>  /**
>   *  tcp_write_err() - close socket and save error info
>   *  @sk:  The socket the error has appeared on.
> @@ -161,19 +193,15 @@ static bool retransmits_timed_out(struct sock *sk,
>  				  unsigned int timeout)
>  {
>  	const unsigned int rto_base = TCP_RTO_MIN;
> -	unsigned int linear_backoff_thresh, start_ts;
> +	unsigned int linear_backoff_thresh;
> +	u32 start_ts;

This seems a gratuitous change, making your patch bigger than necessary.

u32 and unsigned int are the same essentially.

>  
>  	if (!inet_csk(sk)->icsk_retransmits)
>  		return false;
>  
> -	start_ts = tcp_sk(sk)->retrans_stamp;
> -	if (unlikely(!start_ts)) {
> -		struct sk_buff *head = tcp_rtx_queue_head(sk);
> -
> -		if (!head)
> -			return false;
> -		start_ts = tcp_skb_timestamp(head);
> -	}
> +	start_ts = tcp_retransmit_stamp(sk);
> +	if (!start_ts)
> +		return false;

One thing you could do is to make this refactoring in a single patch,
and provide a patch series, with small units, making the review easier.

>  
>  	if (likely(timeout == 0)) {
>  		linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base);
> @@ -535,7 +563,8 @@ void tcp_retransmit_timer(struct sock *sk)
>  		/* Use normal (exponential) backoff */
>  		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>  	}
> -	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
> +	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> +				  tcp_clamp_rto_to_user_timeout(sk), TCP_RTO_MAX);
>  	if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0))
>  		__sk_dst_reset(sk);
>  
> 

Thanks !