netdev - Re: Achieved 10Gbit/s bidirectional routing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1247737144.30876.53.camel@localhost.localdomain>
Date:	Thu, 16 Jul 2009 11:39:04 +0200
From:	Jesper Dangaard Brouer <hawk@...x.dk>
To:	Bill Fink <billfink@...dspring.com>
Cc:	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"David S. Miller" <davem@...emloft.net>,
	Robert Olsson <Robert.Olsson@...a.slu.se>,
	"Waskiewicz Jr, Peter P" <peter.p.waskiewicz.jr@...el.com>,
	"Ronciak, John" <john.ronciak@...el.com>,
	jesse.brandeburg@...el.com,
	Stephen Hemminger <shemminger@...tta.com>,
	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Achieved 10Gbit/s bidirectional routing

On Wed, 2009-07-15 at 23:22 -0400, Bill Fink wrote:
> On Wed, 15 Jul 2009, Jesper Dangaard Brouer wrote:
> 
> > I'm giving a talk at LinuxCon, about 10Gbit/s routing on standard
> > hardware running Linux.
> > 
> >   http://linuxcon.linuxfoundation.org/meetings/1585
> >   https://events.linuxfoundation.org/lc09o17
> > 
> > I'm getting some really good 10Gbit/s bidirectional routing results
> > with Intels latest 82599 chip. (I got two pre-release engineering
> > samples directly from Intel, thanks Peter)
> > 
> > Using a Core i7-920, and tuning the memory according to the RAMs
> > X.M.P. settings DDR3-1600MHz, notice this also increases the QPI to
> > 6.4GT/s.  (Motherboard P6T6 WS revolution)
> > 
> > With big 1514 bytes packets, I can basically do 10Gbit/s wirespeed
> > bidirectional routing.
> > 
> > Notice bidirectional routing means that we actually has to move approx
> > 40Gbit/s through memory and in-and-out of the interfaces.
> > 
> > Formatted quick view using 'ifstat -b'
> > 
> >   eth31-in   eth31-out   eth32-in  eth32-out
> >     9.57  +    9.52  +     9.51 +     9.60  = 38.20 Gbit/s
> >     9.60  +    9.55  +     9.52 +     9.62  = 38.29 Gbit/s
> >     9.61  +    9.53  +     9.52 +     9.62  = 38.28 Gbit/s
> >     9.61  +    9.53  +     9.54 +     9.62  = 38.30 Gbit/s
> > 
> > [Adding an extra NIC]
> > 
> > Another observation is that I'm hitting some kind of bottleneck on the
> > PCI-express switch.  Adding an extra NIC in a PCIe slot connected to
> > the same PCIe switch, does not scale beyond 40Gbit/s collective
> > throughput.

Correcting my self, according to Bill's info below.

It does not scale when adding an extra NIC to the same NVIDIA NF200 PCIe
switch chip (reason explained below by Bill)

 
> > But, I happened to have a special motherboard ASUS P6T6 WS revolution,
> > which has an additional PCIe switch chip NVIDIA's NF200.
> > 
> > Connecting two dual port 10GbE NICs via two different PCI-express
> > switch chips, makes things scale again!  I have achieved a collective
> > throughput of 66.25 Gbit/s.  This results is also influenced by my
> > pktgen machines cannot keep up, and I'm getting closer to the memory
> > bandwidth limits.
> > 
> > FYI: I found a really good reference explaining the PCI-express
> > architecture, written by Intel:
> > 
> >  http://download.intel.com/design/intarch/papers/321071.pdf
> > 
> > I'm not sure how to explain the PCI-express chip bottleneck I'm
> > seeing, but my guess is that I'm limited by the number of outstanding
> > packets/DMA-transfers and the latency for the DMA operations.
> > 
> > Does any one have datasheets on the X58 and NVIDIA's NF200 PCI-express
> > chips, that can tell me the number of outstanding transfers they
> > support?
> 
> We've achieved 70 Gbps aggregate unidirectional TCP performance from
> one P6T6 based system to another.  We figured out in our case that
> we were being limited by the interconnect between the Intel X58 and
> Nvidia N200 chips.  The first 2 PCIe 2.0 slots are directly off the
> Intel X58 and get the full 40 Gbps throughput from the dual-port
> Myricom 10-GigE NICs we have installed in them.  But the other
> 3 PCIe 2.0 slots are on the Nvidia N200 chip, and I discovered
> through googling that the link between the X58 and N200 chips
> only operates at PCIe x16 _1.0_ speed, which limits the possible
> aggregate throughput of the last 3 PCIe 2.0 slots to only 32 Gbps.

This definitly explains the bottlenecks I have seen! Thanks!

Yes, it seems to scale when installing the two NICs in the first two
slots, both connected to the X58.  If overclocking the RAM and CPU a
bit, I can match my pktgen machines speed which gives a collective
throughput of 67.95 Gbit/s.

   eth33          eth34          eth31         eth32
 in     out     in     out     in    out     in    out 
7.54 + 9.58  + 9.56 + 7.56  + 7.33 + 9.53 + 9.50 + 7.35  = 67.95 Gbit/s

Now I just need a faster generator machine, to find the next bottleneck ;-)


> This was clearly seen in our nuttcp testing:
> 
> [root@...aid-1 ~]# ./nuttcp-6.2.6 -In2 -xc0/0 -p5001 192.168.1.11 & ./nuttcp-6.2.6 -In3 -xc0/0 -p5002 192.168.2.11 & ./nuttcp-6.2.6 -In4 -xc1/1 -p5003 192.168.3.11 & ./nuttcp-6.2.6 -In5 -xc1/1 -p5004 192.168.4.11 & ./nuttcp-6.2.6 -In6 -xc2/2 -p5005 192.168.5.11 & ./nuttcp-6.2.6 -In7 -xc2/2 -p5006 192.168.6.11 & ./nuttcp-6.2.6 -In8 -xc3/3 -p5007 192.168.7.11 & ./nuttcp-6.2.6 -In9 -xc3/3 -p5008 192.168.8.11
> n2: 11505.2648 MB /  10.09 sec = 9566.2298 Mbps 37 %TX 55 %RX 0 retrans 0.10 msRTT
> n3: 11727.4489 MB /  10.02 sec = 9815.7570 Mbps 39 %TX 44 %RX 0 retrans 0.10 msRTT
> n4: 11770.1250 MB /  10.07 sec = 9803.9901 Mbps 39 %TX 51 %RX 0 retrans 0.10 msRTT
> n5: 11837.9320 MB /  10.05 sec = 9876.5725 Mbps 39 %TX 47 %RX 0 retrans 0.10 msRTT
> n6:  9096.8125 MB /  10.09 sec = 7559.3310 Mbps 30 %TX 32 %RX 0 retrans 0.10 msRTT
> n7:  9100.1211 MB /  10.10 sec = 7559.7790 Mbps 30 %TX 44 %RX 0 retrans 0.10 msRTT
> n8:  9095.6179 MB /  10.10 sec = 7557.9983 Mbps 31 %TX 33 %RX 0 retrans 0.10 msRTT
> n9:  9075.5472 MB /  10.08 sec = 7551.0234 Mbps 31 %TX 33 %RX 0 retrans 0.11 msRTT
> 
> This used 4 dual-port Myricom 10-GigE NICs.  We also tested with
> a fifth dual-port 10-GigE NIC, but the aggregate throughput stayed
> at about 70 Gbps, due to the performance bottleneck between the
> X58 and N200 chips.

This is also very excellent results!

Thanks a lot Bill !!!
-- 
Med venlig hilsen / Best regards
  Jesper Brouer
  ComX Networks A/S
  Linux Network developer
  Cand. Scient Datalog / MSc.
  Author of http://adsl-optimizer.dk
  LinkedIn: http://www.linkedin.com/in/brouer

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html