[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121123130749.18764.25962.stgit@dragon>
Date: Fri, 23 Nov 2012 14:08:01 +0100
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Eric Dumazet <eric.dumazet@...il.com>,
"David S. Miller" <davem@...emloft.net>,
Florian Westphal <fw@...len.de>
Cc: Jesper Dangaard Brouer <brouer@...hat.com>, netdev@...r.kernel.org,
Pablo Neira Ayuso <pablo@...filter.org>,
Thomas Graf <tgraf@...g.ch>, Cong Wang <amwang@...hat.com>,
"Patrick McHardy" <kaber@...sh.net>,
"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
Herbert Xu <herbert@...dor.hengli.com.au>
Subject: [RFC net-next PATCH V1 0/9] net: fragmentation performance
scalability on NUMA/SMP systems
This patchset implements significant performance improvements for
fragmentation handling in the kernel, with a focus on NUMA and SMP
based systems.
Review:
Please review these patches. I have on purpose added comments in the
code with the "//" comments style. These comments are to be removed
before applying. They serve as a questions to, you, the reviewer.
The fragmentation code today:
The fragmentation code "protects" kernel resources, by implementing
some memory resource limitation code. This is centered around a
global readers-writer lock, and (per network namespace) an atomic mem
counter and a LRU (Least-Recently-Used) list. (Although separate
global variables and namespace resources, are kept for IPv4, IPv6
and Netfilter reassembly.)
The code tries to keep the memory usage between a high and low
threshold (see: /proc/sys/net/ipv4/ipfrag_{high,low}_thresh). The
"evictor" code cleans up fragments, when the high threshold is
exceeded, and stops only, when the low threshold is reached.
The scalability problem:
Having a global/central variable for a resource limit is obviously a
scalability issue on SMP systems, and even amplified on a NUMA based
system.
When profiling the code, the scalability problems appeared to be the
readers-writer lock. But, surprise, the primary scalability issue
was caused by the global atomic mem limit counter, which, especially
on NUMA systems, would prolong the time spend inside the
readers-writer lock sections. It is not trivial to remove the
readers-writer lock, but it is possible to reduce the number of
writer lock sections.
Testlab:
My original big-testlab were based on four Intel based 10Gbit/s NICs
on two identical Sandy-Bridge-E NUMA system. The testlab
used/available, while rebasing to net-next, were not as powerful.
Its based on a single Sandy-Bridge-E NUMA system with the same Intel
10G NICs, but the generator machine was an old Core-i7 920 with some
older NICs. This means that I have not been able to generate full 4x
10G wirespeed. I have chosen (mostly) to include 2x 10G test results
due to the generator machine (although the 4x 10G results from the
big system looks more impressive).
The tests are performed with netperf -t UDP_STREAM (which default
send UDP packets with size 65507 bytes, which gets fragmented). The
netserver's get numactl pinned and the CPU sockets get smp_affinity
aligned to the physical NIC connected to its own NUMA node.
Performance results:
For the impressive 4x 10Gbit/s big-testlab results, performance goes
from (a collective) 496 Mbit/s to 38463 Mbit/s (per stream 9615 Mbit/s)
(at packet size 65507 bytes)
For the results to be fair/meaningful, I'll report the used packet
size, as (after the fixes) bigger UDP packets scale better, because
smaller packets will require/create more frag queues to handle.
I'll report packet size 65507 and three fragments 1472*3=4416 bytes.
Disabled Ethernet Flow Control (via ethtool -A). To show the real
effect of the patches, the system needs to be in an "overload"
situation. When Ethernet Flow Control is enabled, the system will
make the generator back-off, and the code path will be less stressed.
Thus, I have disabled Ethernet Flow Control.
No patches:
-------
Results without any patches, and no flow control:
2x10G size(65507) result:(7+50) =57 Mbit/s (gen:9613+9473 Mbit/s)
2x10G size(4416) result:(3619+3772)=7391 Mbit/s (gen:8339+9105 Mbit/s)
The very pure result with large frames is a result of the "evictor"
code, which gets fixed in patch-01.
Patch-01: net: frag evictor, avoid killing warm frag queues
-------
The fragmentation evictor system have a very unfortunate eviction
system for killing fragment, when the system is put under pressure.
The evictor code basically kills "warm" fragments too quickly.
Resulting in a massive, DoS like, performance drop, as seen above
(no-patch) results with large packets.
The solution is to avoid killing "warm" fragments, and rather block
new incoming in case mem limit is exceeded. This is solved by
introducing a creation time-stamp, which set to "jiffies" in
inet_frag_alloc().
2x10G size(65507) result:(3011+2568)=5579 Mbit/s (gen:9613+9553 Mbit/s)
2x10G size(4416) result:(3716+3518)=7234 Mbit/s (gen:9037+8614 Mbit/s)
Patch-02: cache line adjust inet_frag_queue.net (netns)
-------
Avoid possible cache-line bounces in struct inet_frag_queue. By
moving the net pointer (struct netns_frags) because its placed on the
same write-often cache-line as e.g. refcnt and lock.
2x10G size(65507) result:(2960+2613)=5573 Mbit/s (gen:9614+9465 Mbit/s)
2x10G size(4416) result:(3858+3650)=7508 Mbit/s (gen:8076+7633 Mbit/s)
The performance benefit looks small. We can discuss if this patch is
needed or not.
Patch-03: move LRU list maintenance outside of rwlock
-------
Updating the fragmentation queues LRU (Least-Recently-Used) list,
required taking the hash writer lock. However, the LRU list isn't
tied to the hash at all, so we can use a separate lock for it.
This patch looks like a performance loss for big packets, but the LRU
locking changes are needed, by later patches.
2x10G size(65507) result:(2533+2138)=4671 Mbit/s (gen:9612+9461 Mbit/s)
2x10G size(4416) result:(3952+3713)=7665 Mbit/s (gen:9168+8415 Mbit/s)
Patch-04: frag helper functions for mem limit tracking
-------
This patch is only meant as a preparation patch, towards the next
patch. The performance improvement comes from reduce the number
atomic operation, during freeing of a frag queue, by summing the mem
accounting before and doing a single atomic dec.
2x10G size(65507) result:(2475+3101)=5576 Mbit/s (gen:9614+9439 Mbit/s)
2x10G size(4416) result:(3928+4129)=8057 Mbit/s (gen:7259+8131 Mbit/s)
Patch-05: per CPU mem limit and LRU list accounting
-------
The major performance bottleneck on NUMA systems, is the mem limit
counter, which is based on an atomic counter. This patch removes the
cache-bouncing of the atomic counter, by moving this accounting to be
bound to each CPU. The LRU list also need to be done per CPU,
in-order to keep the accounting straight.
2x10G size(65507) result:(9603+9458)=19061 Mbit/s (gen:9614+9458 Mbit/s)
2x10G size(4416) result:(4871+4848)=9719 Mbit/s (gen:9107+8378 Mbit/s)
To compare the benefit of the next patches, its necessary to increase
the stress on the code, but doing 4x 10Gbit/s tests.
4x10G size(65507) result:(8631+9337+7534+6928)=32430 Mbit/s
(gen:8646+9613+7547+6937 =32743 Mbit/s)
4x10G size(4416) result:(2870+2990+2993+3016)=11869 Mbit/s
(gen:4819+7767+6893+5043 =24522 Mbit/s)
Patch-06: nqueues_under_LRU_lock
-------
This patch just moves the nqueues counter under the LRU lock (and
per CPU), instead of the write lock, to prepare for next patch. No
need for performance testing this part.
Patch-07: hash_bucket_locking
-------
This patch implements per hash bucket locking for the frag queue
hash. This removes two write locks, and the only remaining write
lock is for protecting hash rebuild. This essentially reduces the
readers-writer lock to a rebuild lock.
UPDATE: This patch can result in a OOPS during hash rebuilding.
Needs more work before its safe to apply.
2x10G size(65507) result:(9602+9466)=19068 Mbit/s (gen:9613+9472 Mbit/s)
2x10G size(4416) result:(5024+4925)= 9949 Mbit/s (gen:8581+8957 Mbit/s)
To see the real benefit of this patch, we need to crank up the load
and stress on the code, with 4x 10Gbit/s at small packets,
improvement at size(4416): before 11869 Mbit/s now 17155 Mbit/s. Also
note the regression at size(65507) 32430 -> 31021.
4x10G size(65507) result:(7618+8708+7381+7314)=31021 Mbit/s
(gen:7628+9501+8728+7321 =33178 Mbit/s)
4x10G size(4416) result:(4156+4714+4300+3985)=17155 Mbit/s
(gen:6614+5330+7745+5366 =25055 Mbit/s)
At 4x10G size(4416) I have seen 206 frag queues in use, and hash size is 64.
Patch-08: cache_align_hash_bucket
-------
Increase frag queue hash size and assure cache-line alignment to
avoid false sharing. Hash size is set to 256, because I have
observed 206 frag queues in use at 4x10G with packet size 4416 bytes.
2x10G size(65507) result:(9601+9414)=19015 Mbit/s (gen:9614+9434 Mbit/s)
2x10G size(4416) result:(5421+5268)=10689 Mbit/s (gen:8028+7457 Mbit/s)
This does introduce an improvement (although not as big as I
expected), but most importantly the regression seen in patch-07 4x10G
at size(65507) is gone (patch-05:32430 Mbits/s -> 32676 Mbit).
4x10G size(65507) result:(7604+8307+9593+7172)=32676 Mbit/s
(gen:7615+8713+9606+7184 =33118 Mbit/s)
4x10G size(4416) result:(4890+4364+4139+4530)=17923 Mbit/s
(gen:5170+6873+5215+7632 =24890 Mbit/s)
After this patch it looks like the read lock is now the new
contention point.
Patch-09: Hack disable rebuild and remove rw_lock
-------
I've done a quick hack patch, that remove the readers-writer lock, by
disabling/breaking hash rebuilding. Just to see how big the
performance gain would be.
2x10G size(4416) result: 6481+6764 = 13245 Mbit/s (gen: 7652+8077 Mbit/s)
4x10G size(4416) result:(5610+6283+5735+5238)=22866 Mbit/s
(gen: 6530+7860+5967+5238 =25595 Mbit/s)
And the results show, that its a big win. With 4x10G size(4416)
before: 17923 Mbit/s -> now: 22866 Mbit/s increase 4943 Mbit/s.
With 2x10G size(4416) before 10689 Mbit/s -> 13245 Mbit/s
increase 2556 Mbit/s.
I'll work on a real solution for removing the rw_lock while still
supporting hash rebuilding. Suggestions and ideas are welcome.
This patchset is based upon:
Davem's net-next tree:
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
On top of:
commit ff33c0e1885cda44dd14c79f70df4706f83582a0
(net: Remove bogus dependencies on INET)
---
Jesper Dangaard Brouer (9):
net: frag remove readers-writer lock (hack)
net: increase frag queue hash size and cache-line
net: frag queue locking per hash bucket
net: frag, move nqueues counter under LRU lock protection
net: frag per CPU mem limit and LRU list accounting
net: frag helper functions for mem limit tracking
net: frag, move LRU list maintenance outside of rwlock
net: frag cache line adjust inet_frag_queue.net
net: frag evictor, avoid killing warm frag queues
include/net/inet_frag.h | 120 +++++++++++++++++++++++--
include/net/ipv6.h | 4 -
net/ipv4/inet_fragment.c | 150 ++++++++++++++++++++++---------
net/ipv4/ip_fragment.c | 43 +++++----
net/ipv6/netfilter/nf_conntrack_reasm.c | 13 +--
net/ipv6/reassembly.c | 16 ++-
6 files changed, 259 insertions(+), 87 deletions(-)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Sr. Network Kernel Developer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists