netdev - Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.2.00.1204191401240.735@wel-95.cs.helsinki.fi>
Date:	Thu, 19 Apr 2012 14:10:38 +0300 (EEST)
From:	"Ilpo Järvinen" <ilpo.jarvinen@...sinki.fi>
To:	Eric Dumazet <eric.dumazet@...il.com>
cc:	Neal Cardwell <ncardwell@...gle.com>,
	David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>,
	Tom Herbert <therbert@...gle.com>,
	"Maciej Żenczykowski" <maze@...gle.com>,
	Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head()
 calls

On Wed, 18 Apr 2012, Eric Dumazet wrote:

> From: Eric Dumazet <edumazet@...gle.com>
> 
> While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
> unexpected profiling results, with pskb_expand_head() being in the top.
> 
> After further analysis, I found we hit badly page refcounts,
> because when we transmit full size skb (64 KB), we can receive ACK for
> the first segments of the frame while skb was not completely sent by
> NIC.
> 
> It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
> close peer, we can receive TCP ACK in less than 50 us rtt.
> 
> This is also true on 1Gb links but we were limited by wire speed, not
> cpu.
> 
> When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
> because the skb clone we did for transmit is still alive in TX ring
> buffer.
> 
> pskb_expand_head() is really expensive : It has to make about 16+16
> atomic operations on page refcounts, not counting the skb head
> reallocation/copy. It increases chances of false sharing.
> 
> In fact, we dont really need to trim skb. This costly operation can be
> delayed to the point it is really needed : Thats when a retransmit must
> happen.
> 
> Most of the time, upcoming ACKS will ack the whole packet, and we can
> free it with minimal cost (since clone was already freed by TX
> completion)
> 
> Of course, this means we dont uncharge the acked part from socket limits
> until retransmit, but this is hardly a concern with current autotuning
> (around 4MB per socket)
> Even with small cwnd limit, a single packet can not hold more than half
> the window.
> 
> Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
> About 3% less cpu used for same workload (single netperf TCP_STREAM),
> bounded by x4 PCI-e slots (4660 Mbits).
> 
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Tom Herbert <therbert@...gle.com>
> Cc: Neal Cardwell <ncardwell@...gle.com>
> Cc: Maciej Żenczykowski <maze@...gle.com>
> Cc: Yuchung Cheng <ycheng@...gle.com>
> ---
> v2 : added Neal suggestions
> 
>  include/net/tcp.h     |    6 ++++--
>  net/ipv4/tcp_input.c  |   22 +++++++++++-----------
>  net/ipv4/tcp_output.c |   25 +++++++++++++++++--------
>  3 files changed, 32 insertions(+), 21 deletions(-)
> 
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index d5984e3..0f57706 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
>  extern void tcp_retransmit_timer(struct sock *sk);
>  extern void tcp_xmit_retransmit_queue(struct sock *);
>  extern void tcp_simple_retransmit(struct sock *);
> -extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
> +extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
> +				 unsigned int mss_now);
>  extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
>  
>  extern void tcp_send_probe0(struct sock *);
> @@ -640,7 +641,8 @@ struct tcp_skb_cb {
>  #if IS_ENABLED(CONFIG_IPV6)
>  		struct inet6_skb_parm	h6;
>  #endif
> -	} header;	/* For incoming frames		*/
> +		unsigned int offset_ack; /* part of acked data in this skb */
> +	} header;
>  	__u32		seq;		/* Starting sequence number	*/
>  	__u32		end_seq;	/* SEQ + FIN + SYN + datalen	*/
>  	__u32		when;		/* used to compute rtt's	*/
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 99448f0..bdec2e6 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk)
>  	}
>  }
>  
> -/* If we get here, the whole TSO packet has not been acked. */
> +/* If we get here, the whole packet has not been acked.
> + * We used to call tcp_trim_head() to remove acked data from skb,
> + * but its expensive with TSO if our previous clone is still in flight.
> + * We thus maintain an offset_ack, and hope no pskb_expand_head()
> + * is needed until whole packet is acked by upcoming ACKs.
> + */
>  static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
>  {
>  	struct tcp_sock *tp = tcp_sk(sk);
> -	u32 packets_acked;
> +	u32 oldpcount = tcp_skb_pcount(skb);
>  
>  	BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
>  
> -	packets_acked = tcp_skb_pcount(skb);
> -	if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
> -		return 0;
> -	packets_acked -= tcp_skb_pcount(skb);
> +	TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;

Now that you have non-zero offset_ack, are the tcp_fragment() callsites 
safe and working? ...I'm mostly worried about tcp_mark_head_lost which 
does some assumptions about tp->snd_una and TCP_SKB_CB(skb)->seq, however, 
also other fragmenting does not preserve offset_ack properly (which might 
not be end of world though)?

-- 
 i.