[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090812003049.185cd52a.billfink@mindspring.com>
Date: Wed, 12 Aug 2009 00:30:49 -0400
From: Bill Fink <billfink@...dspring.com>
To: Andi Kleen <andi@...stfloor.org>
Cc: Neil Horman <nhorman@...driver.com>,
Andrew Gallatin <gallatin@...i.com>,
Brice Goglin <Brice.Goglin@...ia.fr>,
Linux Network Developers <netdev@...r.kernel.org>,
Yinghai Lu <yhlu.kernel@...il.com>
Subject: Re: Receive side performance issue with multi-10-GigE and NUMA
On Wed, 12 Aug 2009, Andi Kleen wrote:
> Bill Fink <billfink@...dspring.com> writes:
> >
> > I originally tried to just use alloc_pages_node() instead of alloc_pages(),
> > but it didn't help. As mentioned in an earlier e-mail, that seems to
> > be because I discovered that doing:
> >
> > find /sys -name numa_node -exec grep . {} /dev/null \;
> >
> > revealed that the NUMA node associated with _all_ the PCI devices was
> > always 0, when at least some of them should have been associated with
> > NUMA node 2, including 6 of the 12 Myricom 10-GigE devices.
>
> > I discovered today that the NUMA node cpulist/cpumap is also wrong.
> > A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a
> > cpumask of 00000000,000000ff), while the cpulist for node2 is empty
> > (with a cpumask of 00000000,00000000). The distance is correct,
> > with "10 20" for node 0 and "20 10" for node2.
>
> When the CPU nodes are not correct the device nodes are unlikely
> to correct either. In fact your system likely has no node 1 configured,
> right?
That was right. There was no node 1, only nodes 0 and 2.
> This information comes from the BIOS. So either your BIOS is broken
> or you simply didn't enable NUMA mode in the BIOS, but configured
> memory interleaving.
>
> If you post dmesg output somewhere I can take a look.
I did have NUMA enabled, and memory was configured as independent
rather than interleaved.
Based on all the discussions, it seemed a good possibility that the
BIOS was broken. Today a colleague checked the SuperMicro site, and
discovered and installed a newer version of the BIOS. Things seem
better now, but not totally correct.
There are now NUMA nodes 0 and 1 instead of 0 and 2, and the CPUs
for node 0 are 0 through 3 while the CPUs for node 1 are 4 through 7
(previously the even CPUs were on the first Xeon 5580 processor while
the odd CPUs were on the second processor).
[root@...ntest1 ~]# numastat
node0 node1
numa_hit 28087735 27195340
numa_miss 0 0
numa_foreign 0 0
interleave_hit 12065 11978
local_node 28081559 27182572
other_node 6176 12768
[root@...ntest1 ~]# grep 'physical id' /proc/cpuinfo
physical id : 0
physical id : 0
physical id : 0
physical id : 0
physical id : 1
physical id : 1
physical id : 1
physical id : 1
[root@...ntest1 ~]# cat /sys/devices/system/node/node0/cpulist
0-3
[root@...ntest1 ~]# cat /sys/devices/system/node/node1/cpulist
4-7
But _all_ the PCI devices are still just on node 0.
[root@...ntest1 ~]# find /sys -name numa_node -exec grep . {} /dev/null \;
shows numa_node is always 0.
[root@...ntest1 ~]# find /sys -name local_cpulist -exec grep . {} /dev/null \;
shows local_cpulist is always 0-3.
I now can get basically the same level of aggregate receive side
performance (55 Gbps) without my patch that I could previously get
only with my hacked workaround in the myri10ge driver. But this
still seems significantly subpar to what I believe it should be
capable of.
BTW when I first booted the test system after upgrading the BIOS,
I got a kernel oops because it was still using my hacked myri10ge
driver, and apparently it didn't like that I was specifying to
use a then nonexistent node 2 (I was checking for success of the
alloc_pages_node() call and falling back to the original alloc_pages()
call on failure). Or it could have been on the __alloc_skb() call
where I had a similar hack for the skb allocation.
Are you still interested in me posting the dmesg output?
-Thanks
-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists