[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090807221211.GA16874@localhost.localdomain>
Date: Fri, 7 Aug 2009 18:12:11 -0400
From: Neil Horman <nhorman@...driver.com>
To: Bill Fink <billfink@...dspring.com>
Cc: Linux Network Developers <netdev@...r.kernel.org>, brice@...i.com,
gallatin@...i.com
Subject: Re: Receive side performance issue with multi-10-GigE and NUMA
On Fri, Aug 07, 2009 at 05:06:00PM -0400, Bill Fink wrote:
> I've run into a major receive side performance issue with multi-10-GigE
> on a NUMA system. The system is using a SuperMicro X8DAH+-F motherboard
> with 2 3.2 GHz quad-core Intel Xeon 5580 processors and 12 GB of
> 1333 MHz DDR3 memory. It is a Fedora 10 system but using the latest
> 2.6.29.6 kernel from Fedora 11 (originally tried the 2.6.27.29 kernel
> from Fedora 10).
>
> The test setup is:
>
> i7test1----(6)----xeontest1----(6)----i7test2
> 10-GigE 10-GigE
>
> So xeontest1 has 6 dual-port Myricom 10-GigE NICs for a total
> of 12 10-GigE interfaces. eth2 through eth7 (which are on the
> second Intel 5520 I/O Hub) are connected to i7test1 while
> eth8 through eth13 (which are on the first Intel 5520 I/O Hub)
> are connected to i7test2.
>
> Previous direct testing between i7test1 and i7test2 (which use an
> Asus P6T6 WS Revolution motherboard) demonstrated that they could
> achieve ~70 Gbps performance for either transmit or receive using
> 8 10-GigE interfaces.
>
> The transmit side performance of xeontest1 is fantastic:
>
> [root@...ntest1 ~]# numactl --membind=2 nuttcp -In2 -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -xc6/3 -p5012 192.168.12.11 &
> n12: 9648.0522 MB / 10.00 sec = 8091.4066 Mbps 49 %TX 26 %RX 0 retrans 0.18 msRTT
> n9: 11130.5320 MB / 10.01 sec = 9328.3224 Mbps 47 %TX 37 %RX 0 retrans 0.19 msRTT
> n11: 9418.1250 MB / 10.00 sec = 7897.5848 Mbps 50 %TX 30 %RX 0 retrans 0.18 msRTT
> n10: 9279.4758 MB / 10.01 sec = 7778.7146 Mbps 49 %TX 28 %RX 0 retrans 0.12 msRTT
> n8: 11142.6574 MB / 10.01 sec = 9340.3789 Mbps 47 %TX 35 %RX 0 retrans 0.18 msRTT
> n13: 9422.1492 MB / 10.01 sec = 7897.4115 Mbps 49 %TX 25 %RX 0 retrans 0.17 msRTT
> n3: 11471.2500 MB / 10.01 sec = 9613.9477 Mbps 49 %TX 32 %RX 0 retrans 0.15 msRTT
> n6: 9339.6354 MB / 10.01 sec = 7828.5345 Mbps 50 %TX 25 %RX 0 retrans 0.19 msRTT
> n4: 9093.2500 MB / 10.01 sec = 7624.1589 Mbps 49 %TX 28 %RX 0 retrans 0.15 msRTT
> n5: 9121.8367 MB / 10.01 sec = 7646.8646 Mbps 50 %TX 29 %RX 0 retrans 0.17 msRTT
> n7: 9292.2500 MB / 10.01 sec = 7789.1574 Mbps 49 %TX 26 %RX 0 retrans 0.17 msRTT
> n2: 11487.1150 MB / 10.01 sec = 9627.2690 Mbps 49 %TX 46 %RX 0 retrans 0.19 msRTT
>
> Aggregate performance: 100.4637 Gbps
>
> The problem is with the receive side performance.
>
> [root@...ntest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
> n11: 6983.6359 MB / 10.09 sec = 5803.2293 Mbps 13 %TX 26 %RX 0 retrans 0.11 msRTT
> n10: 7000.1557 MB / 10.11 sec = 5807.5978 Mbps 13 %TX 26 %RX 0 retrans 0.12 msRTT
> n9: 2451.7206 MB / 10.21 sec = 2014.8397 Mbps 4 %TX 13 %RX 0 retrans 0.11 msRTT
> n13: 2453.0887 MB / 10.20 sec = 2016.8751 Mbps 3 %TX 11 %RX 0 retrans 0.10 msRTT
> n12: 2446.5303 MB / 10.24 sec = 2004.4638 Mbps 4 %TX 11 %RX 0 retrans 0.10 msRTT
> n8: 2462.5890 MB / 10.26 sec = 2014.0272 Mbps 3 %TX 11 %RX 0 retrans 0.12 msRTT
> n4: 2763.5091 MB / 10.26 sec = 2258.4871 Mbps 4 %TX 14 %RX 0 retrans 0.10 msRTT
> n5: 2770.0887 MB / 10.28 sec = 2261.2562 Mbps 4 %TX 15 %RX 0 retrans 0.10 msRTT
> n2: 1777.7277 MB / 10.32 sec = 1444.9054 Mbps 2 %TX 11 %RX 0 retrans 0.11 msRTT
> n6: 1772.7962 MB / 10.31 sec = 1442.0346 Mbps 3 %TX 10 %RX 0 retrans 0.11 msRTT
> n3: 1779.4535 MB / 10.32 sec = 1446.0090 Mbps 2 %TX 11 %RX 0 retrans 0.15 msRTT
> n7: 1770.8359 MB / 10.35 sec = 1435.4757 Mbps 2 %TX 11 %RX 0 retrans 0.12 msRTT
>
> Aggregate performance: 29.9492 Gbps
>
> I suspected that this was because the memory being allocated by the
> myri10ge driver was not being allocated on the optimum NUMA node.
> BTW the NUMA nodes on the system are 0 and 2 instead of 0 and 1 which
> is what I would have expected, but this is my first experience with
> a NUMA system.
>
> Based upon a patch by Peter Zijlstra that I discovered through Google
> searching, I tried patching the myri10ge driver to change its memory
> allocation of memory pages from alloc_pages() to alloc_pages_node()
> and specifying the NUMA node of the parent device of the Myricom 10-GigE
> device, which IIUC should be the PCIe switch. This didn't help.
>
> This could be because I discovered that if I did:
>
> find /sys -name numa_node -exec grep . {} /dev/null \;
>
> that the numa_node associated with all the PCI devices was always 0,
> and if IIUC then I believe some of the PCI devices should have been
> associated with NUMA node 2. Perhaps this is what is causing all
> the memory pages allocated by the myri10ge driver to be on NUMA
> node 0, and thus causing the major performance issue.
>
> To kludge around this, I made a different patch to the myri10ge driver.
> This time I hardcoded the NUMA node in the call to alloc_pages_node()
> to 2 for devices with an IRQ between 113 and 118 (eth2 through eth7)
> and to 0 for devices with an IRQ between 119 and 124 (eth8 through eth13).
> This is of course very specific to our specific system (NUMA node ids
> and Myricom 10-GigE device IRQs), and is not something that would be
> generically applicable. But it was useful as a test, and it did
> improve the receive side performance substantially!
>
> [root@...ntest1 ~]# numactl --membind=2 nuttcp -In2 -r -xc1/0 -p5001 192.168.1.10 & numactl --membind=2 nuttcp -In3 -r -xc3/0 -p5002 192.168.2.10 & numactl --membind=2 nuttcp -In4 -r -xc5/1 -p5003 192.168.3.10 & numactl --membind=2 nuttcp -In5 -r -xc7/1 -p5004 192.168.4.10 & nuttcp -In8 -r -xc0/0 -p5007 192.168.7.11 & nuttcp -In9 -r -xc2/0 -p5008 192.168.8.11 & nuttcp -In10 -r -xc4/1 -p5009 192.168.9.11 & nuttcp -In11 -r -xc6/1 -p5010 192.168.10.11 & numactl --membind=2 nuttcp -In6 -r -xc5/2 -p5005 192.168.5.10 & numactl --membind=2 nuttcp -In7 -r -xc7/3 -p5006 192.168.6.10 & nuttcp -In12 -r -xc4/2 -p5011 192.168.11.11 & nuttcp -In13 -r -xc6/3 -p5012 192.168.12.11 &
> n5: 8221.2911 MB / 10.09 sec = 6836.0343 Mbps 17 %TX 31 %RX 0 retrans 0.12 msRTT
> n4: 8237.9524 MB / 10.10 sec = 6840.2379 Mbps 16 %TX 31 %RX 0 retrans 0.11 msRTT
> n11: 7935.3750 MB / 10.11 sec = 6586.2476 Mbps 15 %TX 29 %RX 0 retrans 0.16 msRTT
> n2: 4543.1621 MB / 10.13 sec = 3763.0669 Mbps 9 %TX 21 %RX 0 retrans 0.12 msRTT
> n10: 7916.3925 MB / 10.13 sec = 6555.5210 Mbps 15 %TX 28 %RX 0 retrans 0.13 msRTT
> n7: 4558.4817 MB / 10.14 sec = 3771.6557 Mbps 7 %TX 22 %RX 0 retrans 0.10 msRTT
> n13: 4390.1875 MB / 10.14 sec = 3633.6421 Mbps 6 %TX 21 %RX 0 retrans 0.12 msRTT
> n3: 4572.6478 MB / 10.15 sec = 3778.2596 Mbps 9 %TX 21 %RX 0 retrans 0.14 msRTT
> n6: 4564.4776 MB / 10.14 sec = 3774.4373 Mbps 9 %TX 21 %RX 0 retrans 0.11 msRTT
> n8: 4409.8551 MB / 10.16 sec = 3642.1920 Mbps 8 %TX 19 %RX 0 retrans 0.12 msRTT
> n9: 4412.7836 MB / 10.16 sec = 3643.7788 Mbps 8 %TX 20 %RX 0 retrans 0.14 msRTT
> n12: 4413.4061 MB / 10.16 sec = 3645.2544 Mbps 8 %TX 21 %RX 0 retrans 0.11 msRTT
>
> Aggregate performance: 56.4703 Gbps
>
> This was basically double the previous receive side performance
> without the patch.
>
> I don't know if this is fundamentally a myri10ge driver issue or
> some underlying Linux kernel issue, so it's not clear to me what
> a proper fix would be.
>
> Finally, while definitely a major improvement, I think it should be
> possible to do even better, since we achieved 70 Gbps in the i7 to i7
> tests, and probably could have done 80 Gbps except for an Asus
> motherboard restriction with the interconnect between the Intel X58
> and Nvidia NF200 chips. It's definitely a big step in the right
> direction though if this issue can be resolved.
>
> Any help greatly appreicated in advance.
>
> -Thanks
>
> -Bill
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
You're timing is impeccable! I just posted a patch for an ftrace module to help
detect just these kind of conditions:
http://marc.info/?l=linux-netdev&m=124967650218846&w=2
Hope that helps you out
Neil
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists