lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Tue, 27 Jan 2009 00:10:44 +0100
From:	Eric Dumazet <dada1@...mosbay.com>
To:	Rick Jones <rick.jones2@...com>
CC:	Linux Network Development list <netdev@...r.kernel.org>,
	Netfilter Developers <netfilter-devel@...r.kernel.org>,
	Stephen Hemminger <shemminger@...tta.com>,
	Patrick McHardy <kaber@...sh.net>
Subject: Re: 32 core net-next stack/netfilter "scaling"

Rick Jones a écrit :
> Folks -
> 
> Under:
> 
> ftp://ftp.netperf.org/iptable_scaling
> 
> can be found netperf results and Caliper profiles for three scenarios on
> a 32-core, 1.6 GHz 'Montecito' rx8640 system.  An rx8640 is what HP call
> a "cell based" system in that it is comprised of "cell boards" on which
> reside CPU and memory resources.  In this case there are four cell
> boards, each with 4, dual-core Montecito processors and 1/4 of the
> overall RAM.  The system was configured with a mix of cell-local and
> global interleaved memory, where the global interleave is on a cacheline
> (128 byte) boundary (IIRC).  Total RAM in the system is 256 GB.  The
> cells are joined via cross-bar connections. (numactl --hardware output
> is available under the URL above)
> 
> There was an "I/O expander" connected to the system.  This meant there
> were as many distinct PCI-X domains as there were cells, and every cell
> had a "local" set of PCI-X slots.
> 
> Into those slots I placed four HP AD385A PCI-X 10Gbit Ethernet NICs -
> aka Neterion XFrame IIs.  These were then connected to an HP ProCurve
> 5806 switch, which was in turn connected to three, 4P/16C, 2.3 GHz HP
> DL585 G5s, each of which had a pair of HP AD386A PCIe 10Gbit Ethernet
> NICs (Aka Chelsio T3C-based).  They were running RHEL 5.2 I think.  Each
> NIC was in either a PCI-X 2.0 266 MHz slot (rx8640) or a PCIe 1.mumble
> x8 slot (DL585 G5)
> 
> The kernel is from DaveM's net-next tree ca last week, multiq enabled. 
> The s2io driver is Neterion's out-of-tree version 2.0.36.15914 to get
> multiq support.  It was loaded into the kernel via:
> 
> insmod ./s2io.ko tx_steering_type=3 tx_fifo_num=8
> 
> There were then 8 tx queues and 8 rx queues per interface in the
> rx8640.  The "setaffinity.txt" script was used to set the IRQ affinities
> to cores "closest" to the physical NIC. In all three tests all 32 cores
> went to 100% utilization. At least for all incense and porpoises. (there
> was some occasional idle reported by top on the full_iptables run)
> 
> A set of 64, concurrent "burst mode" netperf omni RR tests (tcp) with a
> burst mode of 17 were run (ie 17 "transactions" outstanding on a
> connection at one time,) with TCP_NODELAY set and the results gathered,
> along with a set of Caliper profiles.  The script used to launch these
> can be found in "runemomniagg2.sh.txt under the URL above.
> 
> I picked an "RR" test to maximize the trips up and down the stack while
> minimizing the bandwidth consumed.
> 
> I picked a burst size of 16 because that was sufficient to saturate a
> single core on the rx8640.
> 
> I picked 64 concurrent netperfs because I wanted to make sure I had
> enough concurrent connections to get spread across all the cores/queues
> by the algorithms in place.
> 
> I picked the combination of 64 and 16 rather than say 1024 and 0 (one
> tran at a time) because I didn't want to run a context switching
> benchmark :)
> 
> The rx8640 was picked because it was available and I was confident it
> was not going to have any hardware scaling issues getting in the way.  I
> wanted to see SW issues, not HW issues. I am ass-u-me-ing the rx8640 is
> a reasonable analog for any "decent or better scaling" 32 core hardware
> and that while there are ia64-specific routines present in the profiles,
> they are there for platform-independent reasons.
> 
> The no_iptables/ data was run after a fresh boot, with no iptables
> commands run and so no iptables related modules loaded into the kernel.
> 
> The empty_iptables/ data was run after an "iptables --list" command
> which loaded one or two modules into the kernel.
> 
> The full_iptables/ data was run after an "iptables-restore" command
> pointed at full_iptables/iptables.txt  which was created from what RH
> creates by default when one enables firewall via their installer, with a
> port range added by me to allow pretty much anything netperf would ask. 
> As such, while it does excercise netfilter functionality, I cannot make
> any claims as to its "real world" applicability.  (while the firewall
> settings came from an RH setup, FWIW, the base bits running on the
> rx8640 are Debian Lenny, with the net-next kernel on top)
> 
> The "cycles" profile is able to grab flat profile hits while interrupts
> are disabled so it can see stuff happening while interrupts are
> disabled.  The "scgprof" profile is an attempt to get some call graphs -
> it does not have visibility into code running with interrupts disabled. 
> The "cache" profile is a profile that looks to get some cache miss
> information.
> 
> So, having said all that, details can be found under the previously
> mentioned URL.  Some quick highlights:
> 
> no_iptables - ~22000 transactions/s/netperf.  Top of the cycles profile
> looks like:
> 
> Function Summary
> -----------------------------------------------------------------------
> % Total
>      IP  Cumulat             IP
> Samples    % of         Samples
>  (ETB)     Total         (ETB)   Function                          File
> -----------------------------------------------------------------------
>    5.70     5.70         37772   s2io.ko::tx_intr_handler
>    5.14    10.84         34012   vmlinux::__ia64_readq
>    4.88    15.72         32285   s2io.ko::s2io_msix_ring_handle
>    4.63    20.34         30625   s2io.ko::rx_intr_handler
>    4.60    24.94         30429   s2io.ko::s2io_xmit
>    3.85    28.79         25488   s2io.ko::s2io_poll_msix
>    2.87    31.65         18987   vmlinux::dev_queue_xmit
>    2.51    34.16         16620   vmlinux::tcp_sendmsg
>    2.51    36.67         16588   vmlinux::tcp_ack
>    2.15    38.82         14221   vmlinux::__inet_lookup_established
>    2.10    40.92         13937   vmlinux::ia64_spinlock_contention
> 
> empty_iptables - ~12000 transactions/s/netperf.  Top of the cycles
> profile looks like:
> 
> Function Summary
> -----------------------------------------------------------------------
> % Total
>      IP  Cumulat             IP
> Samples    % of         Samples
>  (ETB)     Total         (ETB)   Function                          File
> -----------------------------------------------------------------------
>   26.38    26.38        137458   vmlinux::_read_lock_bh
>   10.63    37.01         55388   vmlinux::local_bh_enable_ip
>    3.42    40.43         17812   s2io.ko::tx_intr_handler
>    3.01    43.44         15691   ip_tables.ko::ipt_do_table
>    2.90    46.34         15100   vmlinux::__ia64_readq
>    2.72    49.06         14179   s2io.ko::rx_intr_handler
>    2.55    51.61         13288   s2io.ko::s2io_xmit
>    1.98    53.59         10329   s2io.ko::s2io_msix_ring_handle
>    1.75    55.34          9104   vmlinux::dev_queue_xmit
>    1.64    56.98          8546   s2io.ko::s2io_poll_msix
>    1.52    58.50          7943   vmlinux::sock_wfree
>    1.40    59.91          7302   vmlinux::tcp_ack
> 
> full_iptables - some test instances didn't complete, I think they got
> starved. Of those which did complete, their performance ranged all the
> way from 330 to 3100 transactions/s/netperf.  Top of the cycles profile
> looks like:
> 
> Function Summary
> -----------------------------------------------------------------------
> % Total
>      IP  Cumulat             IP
> Samples    % of         Samples
>  (ETB)     Total         (ETB)   Function                          File
> -----------------------------------------------------------------------
>   64.71    64.71        582171   vmlinux::_write_lock_bh
>   18.43    83.14        165822   vmlinux::ia64_spinlock_contention
>    2.86    85.99         25709   nf_conntrack.ko::init_module
>    2.36    88.35         21194   nf_conntrack.ko::tcp_packet
>    1.78    90.13         16009   vmlinux::_spin_lock_bh
>    1.20    91.33         10810   nf_conntrack.ko::nf_conntrack_in
>    1.20    92.52         10755   vmlinux::nf_iterate
>    1.09    93.62          9833   vmlinux::default_idle
>    0.26    93.88          2331   vmlinux::__ia64_readq
>    0.25    94.12          2213   vmlinux::__interrupt
>    0.24    94.37          2203   s2io.ko::tx_intr_handler
> 
> Suggestions as to things to look at/with and/or patches to try are
> welcome.  I should have the HW available to me for at least a little
> while, but not indefinitely.
> 
> rick jones

Hi Rick, nice hardware you have :)

Stephen had a patch to nuke read_lock() from iptables, using RCU and seqlocks.
I hit this contention point even with low cost hardware, and quite standard application.

I pinged him few days ago to try to finish the job with him, but it seems Stephen
is busy at the moment.

Then conntrack (tcp sessions) is awfull, since it uses a single rwlock_t tcp_lock
 that must be write_locked() for basically every handled tcp frame...

How long is "not indefinitely" ? 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Powered by blists - more mailing lists