[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B0C19A2.9040906@gmail.com>
Date: Tue, 24 Nov 2009 18:36:34 +0100
From: Eric Dumazet <eric.dumazet@...il.com>
To: Tom Herbert <therbert@...gle.com>
CC: Linux Netdev List <netdev@...r.kernel.org>
Subject: Re: NUMA and multiQ interation
Tom Herbert a écrit :
> This is a question about the expected interaction between NUMA and
> receive multi queue. Our test setup is a 16 core AMD system with 4
> sockets, one NUMA node per socket and a bnx2x. The test is running
> 500 streams in netperf RR with response/request of one byte using
> net-next-2.6.
>
> Highest throughput we are seeing is with 4 queues (1 queue processed
> per socket) giving 361862 tps at 67% of cpu. 16 queues (1 queue per
> cpu) gives 226722 tps at 30.43% cpu.
>
> However, with a modified kernel that does RX skb allocations from
> local node rather than the devices numa node, I'm getting 923422 tps
> at 100% cpu. This is much higher tps and better cpu utilization than
> the case where allocations are coming from the device numa node. It
> appears that cross node allocations are a causing a significant
> performance hit. For a 2.5 times performance improvement I'm kind of
> motivated to revert netdev_alloc_skb to when it did not pay attention
> to numa node :-)
>
> What is the expected interaction here, and would these results be
> typical? If so, would this warrant the need to associate each RX
> queue to a numa node, instead of just the device?
>
I believe you answer to your own question Tom.
I always had doubts about forcing RX buffers to be 'close to the device'
And RPS clearly shows that the hard work is done on a CPU close to application,
so it would make sense to allocate RX buffer on the local node (of cpu handling
the RX queue), eg not necessarly on the device numa node.
When packet is transfered from NIC to memory, the NUMA distance is only hit
one time per cache line, no cache needed.
Then, when processing packet by host cpus, we might need many transferts,
because of TCP coalescing and copying to user space.
SLUB/SLAB also pay an extra fee when cross node allocation/deallocation are performed.
Another point to look is the vmalloc() that various drivers use at NIC initialization.
module loading is performed with poor NUMA property (forcing all allocation to one single node)
For example, on this IXGBE adapter, all working space (tx queue rings) is allocated on node 0,
on my dual node machine.
# grep ixgbe_setup_rx_resources /proc/vmallocinfo
0xffffc90006c67000-0xffffc90006c72000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006c73000-0xffffc90006c7e000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d02000-0xffffc90006d0d000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d0e000-0xffffc90006d19000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d1a000-0xffffc90006d25000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d26000-0xffffc90006d31000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d32000-0xffffc90006d3d000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d3e000-0xffffc90006d49000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d4a000-0xffffc90006d55000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d56000-0xffffc90006d61000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d62000-0xffffc90006d6d000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d6e000-0xffffc90006d79000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d7a000-0xffffc90006d85000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d86000-0xffffc90006d91000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d92000-0xffffc90006d9d000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006d9e000-0xffffc90006da9000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e4a000-0xffffc90006e55000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e56000-0xffffc90006e61000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e62000-0xffffc90006e6d000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e6e000-0xffffc90006e79000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e7a000-0xffffc90006e85000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e86000-0xffffc90006e91000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e92000-0xffffc90006e9d000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006e9e000-0xffffc90006ea9000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006eaa000-0xffffc90006eb5000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006eb6000-0xffffc90006ec1000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ec2000-0xffffc90006ecd000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ece000-0xffffc90006ed9000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006eda000-0xffffc90006ee5000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ee6000-0xffffc90006ef1000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006ef2000-0xffffc90006efd000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
0xffffc90006efe000-0xffffc90006f09000 45056 ixgbe_setup_rx_resources+0x45/0x1e0 [ixgbe] pages=10 vmalloc N0=10
alloc_large_system_hash() has better NUMA properties, spreading large hash tables to all nodes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists