netdev - Re: Receive side performance issue with multi-10-GigE and NUMA

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20090812003049.185cd52a.billfink@mindspring.com>
Date:	Wed, 12 Aug 2009 00:30:49 -0400
From:	Bill Fink <billfink@...dspring.com>
To:	Andi Kleen <andi@...stfloor.org>
Cc:	Neil Horman <nhorman@...driver.com>,
	Andrew Gallatin <gallatin@...i.com>,
	Brice Goglin <Brice.Goglin@...ia.fr>,
	Linux Network Developers <netdev@...r.kernel.org>,
	Yinghai Lu <yhlu.kernel@...il.com>
Subject: Re: Receive side performance issue with multi-10-GigE and NUMA

On Wed, 12 Aug 2009, Andi Kleen wrote:

> Bill Fink <billfink@...dspring.com> writes:
> >
> > I originally tried to just use alloc_pages_node() instead of alloc_pages(),
> > but it didn't help.  As mentioned in an earlier e-mail, that seems to
> > be because I discovered that doing:
> >
> > 	find /sys -name numa_node -exec grep . {} /dev/null \;
> >
> > revealed that the NUMA node associated with _all_ the PCI devices was
> > always 0, when at least some of them should have been associated with
> > NUMA node 2, including 6 of the 12 Myricom 10-GigE devices.
> 
> > I discovered today that the NUMA node cpulist/cpumap is also wrong.
> > A cat of /sys/devices/system/node/node0/cpulist returns "0-7" (with a
> > cpumask of 00000000,000000ff), while the cpulist for node2 is empty
> > (with a cpumask of 00000000,00000000).  The distance is correct,
> > with "10 20" for node 0 and "20 10" for node2.
> 
> When the CPU nodes are not correct the device nodes are unlikely
> to correct either. In fact your system likely has no node 1 configured, 
> right?

That was right.  There was no node 1, only nodes 0 and 2.

> This information comes from the BIOS. So either your BIOS is broken
> or you simply didn't enable NUMA mode in the BIOS, but configured
> memory interleaving.
> 
> If you post dmesg output somewhere I can take a look.

I did have NUMA enabled, and memory was configured as independent
rather than interleaved.

Based on all the discussions, it seemed a good possibility that the
BIOS was broken.  Today a colleague checked the SuperMicro site, and
discovered and installed a newer version of the BIOS.  Things seem
better now, but not totally correct.

There are now NUMA nodes 0 and 1 instead of 0 and 2, and the CPUs
for node 0 are 0 through 3 while the CPUs for node 1 are 4 through 7
(previously the even CPUs were on the first Xeon 5580 processor while
the odd CPUs were on the second processor).

[root@...ntest1 ~]# numastat
                           node0           node1
numa_hit                28087735        27195340
numa_miss                      0               0
numa_foreign                   0               0
interleave_hit             12065           11978
local_node              28081559        27182572
other_node                  6176           12768

[root@...ntest1 ~]# grep 'physical id' /proc/cpuinfo
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 0
physical id     : 1
physical id     : 1
physical id     : 1
physical id     : 1

[root@...ntest1 ~]# cat /sys/devices/system/node/node0/cpulist
0-3
[root@...ntest1 ~]# cat /sys/devices/system/node/node1/cpulist
4-7

But _all_ the PCI devices are still just on node 0.

[root@...ntest1 ~]# find /sys -name numa_node -exec grep . {} /dev/null \;

shows numa_node is always 0.

[root@...ntest1 ~]# find /sys -name local_cpulist -exec grep . {} /dev/null \;

shows local_cpulist is always 0-3.

I now can get basically the same level of aggregate receive side
performance (55 Gbps) without my patch that I could previously get
only with my hacked workaround in the myri10ge driver.  But this
still seems significantly subpar to what I believe it should be
capable of.

BTW when I first booted the test system after upgrading the BIOS,
I got a kernel oops because it was still using my hacked myri10ge
driver, and apparently it didn't like that I was specifying to
use a then nonexistent node 2 (I was checking for success of the
alloc_pages_node() call and falling back to the original alloc_pages()
call on failure).  Or it could have been on the __alloc_skb() call
where I had a similar hack for the skb allocation.

Are you still interested in me posting the dmesg output?

						-Thanks

						-Bill
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html