[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1334778707.2472.333.camel@edumazet-glaptop>
Date: Wed, 18 Apr 2012 21:51:47 +0200
From: Eric Dumazet <eric.dumazet@...il.com>
To: Neal Cardwell <ncardwell@...gle.com>
Cc: David Miller <davem@...emloft.net>,
netdev <netdev@...r.kernel.org>,
Tom Herbert <therbert@...gle.com>,
Maciej Żenczykowski <maze@...gle.com>,
Yuchung Cheng <ycheng@...gle.com>
Subject: [PATCH v2 net-next] tcp: avoid expensive pskb_expand_head() calls
From: Eric Dumazet <edumazet@...gle.com>
While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
unexpected profiling results, with pskb_expand_head() being in the top.
After further analysis, I found we hit badly page refcounts,
because when we transmit full size skb (64 KB), we can receive ACK for
the first segments of the frame while skb was not completely sent by
NIC.
It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
close peer, we can receive TCP ACK in less than 50 us rtt.
This is also true on 1Gb links but we were limited by wire speed, not
cpu.
When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
because the skb clone we did for transmit is still alive in TX ring
buffer.
pskb_expand_head() is really expensive : It has to make about 16+16
atomic operations on page refcounts, not counting the skb head
reallocation/copy. It increases chances of false sharing.
In fact, we dont really need to trim skb. This costly operation can be
delayed to the point it is really needed : Thats when a retransmit must
happen.
Most of the time, upcoming ACKS will ack the whole packet, and we can
free it with minimal cost (since clone was already freed by TX
completion)
Of course, this means we dont uncharge the acked part from socket limits
until retransmit, but this is hardly a concern with current autotuning
(around 4MB per socket)
Even with small cwnd limit, a single packet can not hold more than half
the window.
Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
About 3% less cpu used for same workload (single netperf TCP_STREAM),
bounded by x4 PCI-e slots (4660 Mbits).
Signed-off-by: Eric Dumazet <edumazet@...gle.com>
Cc: Tom Herbert <therbert@...gle.com>
Cc: Neal Cardwell <ncardwell@...gle.com>
Cc: Maciej Żenczykowski <maze@...gle.com>
Cc: Yuchung Cheng <ycheng@...gle.com>
---
v2 : added Neal suggestions
include/net/tcp.h | 6 ++++--
net/ipv4/tcp_input.c | 22 +++++++++++-----------
net/ipv4/tcp_output.c | 25 +++++++++++++++++--------
3 files changed, 32 insertions(+), 21 deletions(-)
diff --git a/include/net/tcp.h b/include/net/tcp.h
index d5984e3..0f57706 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -477,7 +477,8 @@ extern int tcp_retransmit_skb(struct sock *, struct sk_buff *);
extern void tcp_retransmit_timer(struct sock *sk);
extern void tcp_xmit_retransmit_queue(struct sock *);
extern void tcp_simple_retransmit(struct sock *);
-extern int tcp_trim_head(struct sock *, struct sk_buff *, u32);
+extern void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ unsigned int mss_now);
extern int tcp_fragment(struct sock *, struct sk_buff *, u32, unsigned int);
extern void tcp_send_probe0(struct sock *);
@@ -640,7 +641,8 @@ struct tcp_skb_cb {
#if IS_ENABLED(CONFIG_IPV6)
struct inet6_skb_parm h6;
#endif
- } header; /* For incoming frames */
+ unsigned int offset_ack; /* part of acked data in this skb */
+ } header;
__u32 seq; /* Starting sequence number */
__u32 end_seq; /* SEQ + FIN + SYN + datalen */
__u32 when; /* used to compute rtt's */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99448f0..bdec2e6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3260,25 +3260,25 @@ static void tcp_rearm_rto(struct sock *sk)
}
}
-/* If we get here, the whole TSO packet has not been acked. */
+/* If we get here, the whole packet has not been acked.
+ * We used to call tcp_trim_head() to remove acked data from skb,
+ * but its expensive with TSO if our previous clone is still in flight.
+ * We thus maintain an offset_ack, and hope no pskb_expand_head()
+ * is needed until whole packet is acked by upcoming ACKs.
+ */
static u32 tcp_tso_acked(struct sock *sk, struct sk_buff *skb)
{
struct tcp_sock *tp = tcp_sk(sk);
- u32 packets_acked;
+ u32 oldpcount = tcp_skb_pcount(skb);
BUG_ON(!after(TCP_SKB_CB(skb)->end_seq, tp->snd_una));
- packets_acked = tcp_skb_pcount(skb);
- if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
- return 0;
- packets_acked -= tcp_skb_pcount(skb);
+ TCP_SKB_CB(skb)->header.offset_ack = tp->snd_una - TCP_SKB_CB(skb)->seq;
- if (packets_acked) {
- BUG_ON(tcp_skb_pcount(skb) == 0);
- BUG_ON(!before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq));
- }
+ if (oldpcount > 1)
+ tcp_set_skb_tso_segs(sk, skb, tcp_skb_mss(skb));
- return packets_acked;
+ return oldpcount - tcp_skb_pcount(skb);
}
/* Remove acknowledged frames from the retransmission queue. If our packet
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index de8790c..f66df37 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -927,11 +927,15 @@ static void tcp_queue_skb(struct sock *sk, struct sk_buff *skb)
sk_mem_charge(sk, skb->truesize);
}
-/* Initialize TSO segments for a packet. */
-static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
- unsigned int mss_now)
+/* Initialize TSO segments for a packet.
+ * Part of skb (offset_ack) might have been acked.
+ */
+void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
+ unsigned int mss_now)
{
- if (skb->len <= mss_now || !sk_can_gso(sk) ||
+ unsigned int len = skb->len - TCP_SKB_CB(skb)->header.offset_ack;
+
+ if (len <= mss_now || !sk_can_gso(sk) ||
skb->ip_summed == CHECKSUM_NONE) {
/* Avoid the costly divide in the normal
* non-TSO case.
@@ -940,7 +944,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
skb_shinfo(skb)->gso_size = 0;
skb_shinfo(skb)->gso_type = 0;
} else {
- skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
+ skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(len, mss_now);
skb_shinfo(skb)->gso_size = mss_now;
skb_shinfo(skb)->gso_type = sk->sk_gso_type;
}
@@ -1125,15 +1129,20 @@ static void __pskb_trim_head(struct sk_buff *skb, int len)
skb->len = skb->data_len;
}
-/* Remove acked data from a packet in the transmit queue. */
-int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
+/* Remove acked data from a packet in the transmit queue.
+ * Ony called before retransmit.
+ */
+static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
{
+ u32 len = tcp_sk(sk)->snd_una - TCP_SKB_CB(skb)->seq;
+
if (skb_cloned(skb) && pskb_expand_head(skb, 0, 0, GFP_ATOMIC))
return -ENOMEM;
__pskb_trim_head(skb, len);
TCP_SKB_CB(skb)->seq += len;
+ TCP_SKB_CB(skb)->header.offset_ack = 0;
skb->ip_summed = CHECKSUM_PARTIAL;
skb->truesize -= len;
@@ -2096,7 +2105,7 @@ int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
if (before(TCP_SKB_CB(skb)->seq, tp->snd_una)) {
if (before(TCP_SKB_CB(skb)->end_seq, tp->snd_una))
BUG();
- if (tcp_trim_head(sk, skb, tp->snd_una - TCP_SKB_CB(skb)->seq))
+ if (tcp_trim_head(sk, skb))
return -ENOMEM;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists