netdev - Re: Achieved 10Gbit/s bidirectional routing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090715232253.91d9f264.billfink@mindspring.com>
Date:	Wed, 15 Jul 2009 23:22:53 -0400
From:	Bill Fink <billfink@...dspring.com>
To:	Jesper Dangaard Brouer <hawk@...x.dk>
Cc:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Robert Olsson <Robert.Olsson@...a.slu.se>,
	"Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@...el.com>,
	"Ronciak, John" <john.ronciak@...el.com>,
	jesse.brandeburg@...el.com,
	Stephen Hemminger <shemminger@...tta.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Achieved 10Gbit/s bidirectional routing

On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote:

> I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard
> hardware running Linux.
> 
>   http://linuxcon.linuxfoundation.org/meetings/1585
>   https://events.linuxfoundation.org/lc09o17
> 
> I'm getting some really good 10Gbit/s bidirectional routing results
> with Intels latest 82599 chip. (I got two pre-release engineering
> samples directly from Intel, thanks Peter)
> 
> Using a Core i7-920, and tuning the memory according to the RAMs
> X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to
> 6.4GT/s.  (Motherboard P6T6 WS revolution)
> 
> With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed
> bidirectional routing.
> 
> Notice bidirectional routing means that we actually has to move approx
> 40Gbit/s through memory and in-and-out of the interfaces.
> 
> Formatted quick view using 'ifstat -b'
> 
>   eth31-in   eth31-out   eth32-in  eth32-out
>     9.57  +    9.52  +     9.51 +     9.60  = 38.20 Gbit/s
>     9.60  +    9.55  +     9.52 +     9.62  = 38.29 Gbit/s
>     9.61  +    9.53  +     9.52 +     9.62  = 38.28 Gbit/s
>     9.61  +    9.53  +     9.54 +     9.62  = 38.30 Gbit/s
> 
> [Adding an extra NIC]
> 
> Another observation is that I'm hitting some kind of bottleneck on the
> PCI-express switch.  Adding an extra NIC in a PCIe slot connected to
> the same PCIe switch, does not scale beyond 40Gbit/s collective
> throughput.
> 
> But, I happened to have a special motherboard ASUS P6T6 WS revolution,
> which has an additional PCIe switch chip NVIDIA's NF200.
> 
> Connecting two dual port 10GbE NICs via two different PCI-express
> switch chips, makes things scale again!  I have achieved a collective
> throughput of 66.25 Gbit/s.  This results is also influenced by my
> pktgen machines cannot keep up, and I'm getting closer to the memory
> bandwidth limits.
> 
> FYI: I found a really good reference explaining the PCI-express
> architecture, written by Intel:
> 
>  http://download.intel.com/design/intarch/papers/321071.pdf
> 
> I'm not sure how to explain the PCI-express chip bottleneck I'm
> seeing, but my guess is that I'm limited by the number of outstanding
> packets/DMA-transfers and the latency for the DMA operations.
> 
> Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express
> chips, that can tell me the number of outstanding transfers they
> support?

We've achieved 70 Gbps aggregate unidirectional TCP performance from
one P6T6 based system to another.  We figured out in our case that
we were being limited by the interconnect between the Intel X58 and
Nvidia N200 chips.  The first 2 PCIe 2.0 slots are directly off the
Intel X58 and get the full 40 Gbps throughput from the dual-port
Myricom 10-GigE NICs we have installed in them.  But the other
3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered
through googling that the link between the X58 and N200 chips
only operates at PCIe x16 _1.0_ speed, which limits the possible
aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps.

This was clearly seen in our nuttcp testing:

[root@...aid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11
n2: 11505.2648 MB /  10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT
n3: 11727.4489 MB /  10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT
n4: 11770.1250 MB /  10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT
n5: 11837.9320 MB /  10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT
n6:  9096.8125 MB /  10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT
n7:  9100.1211 MB /  10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT
n8:  9095.6179 MB /  10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT
n9:  9075.5472 MB /  10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT

This used 4 dual-port Myricom 10-GigE NICs.  We also tested with
a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed
at about 70 Gbps, due to the performance bottleneck between the
X58 and N200 chips.

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html