lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Wed, 18 Apr 2012 14:40:37 -0400
From:	Neal Cardwell <ncardwell@...gle.com>
To:	Eric Dumazet <eric.dumazet@...il.com>
Cc:	David Miller <davem@...emloft.net>,
	netdev <netdev@...r.kernel.org>,
	Tom Herbert <therbert@...gle.com>,
	Maciej Żenczykowski <maze@...gle.com>,
	Yuchung Cheng <ycheng@...gle.com>
Subject: Re: [PATCH net-next] tcp: avoid expensive pskb_expand_head() calls

On Wed, Apr 18, 2012 at 11:49 AM, Eric Dumazet <eric.dumazet@...il.com> wrote:
> From: Eric Dumazet <edumazet@...gle.com>
>
> While doing netperf sessions on 10Gb Intel nics (ixgbe), I noticed
> unexpected profiling results, with pskb_expand_head() being in the top.
>
> After further analysis, I found we hit badly page refcounts,
> because when we transmit full size skb (64 KB), we can receive ACK for
> the first segments of the frame while skb was not completely sent by
> NIC.
>
> It takes ~54 us to send a full TSO packet at 10Gb speed, but with a
> close peer, we can receive TCP ACK in less than 50 us rtt.
>
> This is also true on 1Gb links but we were limited by wire speed, not
> cpu.
>
> When we try to trim skb, tcp_trim_head() has to call pskb_expand_head(),
> because the skb clone we did for transmit is still alive in TX ring
> buffer.
>
> pskb_expand_head() is really expensive : It has to make about 16+16
> atomic operations on page refcounts, not counting the skb head
> reallocation/copy. It increases chances of false sharing.
>
> In fact, we dont really need to trim skb. This costly operation can be
> delayed to the point it is really needed : Thats when a retransmit must
> happen.
>
> Most of the time, upcoming ACKS will ack the whole packet, and we can
> free it with minimal cost (since clone was already freed by TX
> completion)
>
> Of course, this means we dont uncharge the acked part from socket limits
> until retransmit, but this is hardly a concern with current autotuning
> (around 4MB per socket)
> Even with small cwnd limit, a single packet can not hold more than half
> the window.
>
> Performance results on my Q6600 cpu and 82599EB 10-Gigabit card :
> About 3% less cpu used for same workload (single netperf TCP_STREAM),
> bounded by x4 PCI-e slots (4660 Mbits).
>
> Signed-off-by: Eric Dumazet <edumazet@...gle.com>
> Cc: Tom Herbert <therbert@...gle.com>
> Cc: Neal Cardwell <ncardwell@...gle.com>
> Cc: Maciej Żenczykowski <maze@...gle.com>
> Cc: Yuchung Cheng <ycheng@...gle.com>

Nice find!

> +       prev_packets_acked = tcp_skb_pcount(skb);

FWIW, I'd find old_pcount or prev_pcount a little easier to read than
prev_packets_acked here (I see "oldpcount" is used in tcp_output
in a similar context).

>        TCP_SKB_CB(skb)->seq += len;
> +       TCP_SKB_CB(skb)->header.offset_ack = 0;

If the caller decides to trim a prefix of the skb that does not extend
to snd_una, then setting offset_ack to 0 here will cause us to forget
that some prefix is ACKed when we should have remembered this.
However, the API for the function invites the caller to chop off an
arbitrary amount (and there's no comment to disuade the caller from
trying this). This seems to risk bugs in the future.

To attempt to make this API safer and simpler for future generations,
what do you think about calculating the len inside tcp_trim_head(),
something like:

static int tcp_trim_head(struct sock *sk, struct sk_buff *skb)
{
  u32 len = tp->snd_una - TCP_SKB_CB(skb)->seq;
...

...
 if (tcp_trim_head(sk, skb))
...

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists