netdev - Re: Intel 82599 ixgbe driver performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [day] [month] [year] [list]

Message-ID: <4E4422EB.7060508@hp.com>
Date:	Thu, 11 Aug 2011 11:43:55 -0700
From:	Rick Jones <rick.jones2@...com>
To:	"J.Hwan Kim" <frog1120@...il.com>
CC:	netdev@...r.kernel.org
Subject: Re: Intel 82599 ixgbe driver performance

On 08/10/2011 06:57 PM, J.Hwan Kim wrote:
> On 2011년 08월 11일 05:58, Rick Jones wrote:
>> On 08/09/2011 11:19 PM, J.Hwan Kim wrote:
>>> Hi, everyone
>>>
>>> I'm testing our network card which includes intel 82599 based on
>>> ixgbe driver. I wonder what is the Rx performance of i82599 without
>>> network stack only with 64Byte frames. Our driver reads the packet
>>> directly from DMA packet buffer and push to the application without
>>> passing through linux kernel stack. It seems that the intel 82599
>>> cannot push 64B frames to DMA area in 10G. Is it right?
>>
>> Does your driver perform a copy of that 64B frame to user space?
> Our driver and user application shares the packet memory
>
>> Is this a single-threaded test running?
> Now, 4 core is running and 4 RX queue is used, of which intrerrupt
> affinity is set, but the result is worse than 1 single queue.
>> What does an lat_mem_rd -t (-t for random stride) test from lmbench
>> give for your system's memory latency? (Perhaps using numactl to
>> ensure local, or remote memory access, as you desire)
> ./lat_mem_rd -t 128
> "stride=64
>
> 0.00049 1.003
> 0.00098 1.003
> 0.00195 1.003
> 0.00293 1.003
> 0.00391 1.003
> 0.00586 1.003
> 0.00781 1.003
> 0.01172 1.003
> 0.01562 1.003
> 0.02344 1.003
> 0.03125 1.003
> 0.04688 5.293
> 0.06250 5.307
> 0.09375 5.571
> 0.12500 5.683
> 0.18750 5.683
> 0.25000 5.683
> 0.37500 16.394
> 0.50000 42.394

Unless the chip you are using has a rather tiny (by today's standards) 
data cache, you need to go much father there - I suspect that at 0.5 MB 
you have not yet gotten beyond the size of the last level of data cache 
on the chip.

I would suggest:

(from a system that is not otherwise idle...)

./lat_mem_rd -t 512 256"stride=256
0.00049 1.237
0.00098 1.239
0.00195 1.228
0.00293 1.238
0.00391 1.243
0.00586 1.238
0.00781 1.250
0.01172 1.249
0.01562 1.251
0.02344 1.247
0.03125 1.247
0.04688 3.125
0.06250 3.153
0.09375 3.158
0.12500 3.177
0.18750 6.636
0.25000 8.729
0.37500 16.167
0.50000 16.901
0.75000 16.953
1.00000 17.362
1.50000 18.781
2.00000 20.243
3.00000 23.434
4.00000 24.965
6.00000 35.951
8.00000 56.026
12.00000 76.169
16.00000 80.741
24.00000 83.237
32.00000 84.043
48.00000 84.132
64.00000 83.775
96.00000 83.298
128.00000 83.039
192.00000 82.659
256.00000 82.464
384.00000 82.280
512.00000 82.092

You can see the large jump starting at 8MB  - that is where the last 
level cache runs-out on the chip I'm using - an Intel W3550.

Now, as run, that will include TLB miss overhead once the area of memory 
being accessed is larger than can be mapped by the chip's TLB at the 
page size being used.  You can use libhugetlbfs to mitigate that through 
the use of hugepages.

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html