[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <50044F1D.6000703@hp.com>
Date: Mon, 16 Jul 2012 10:27:57 -0700
From: Rick Jones <rick.jones2@...com>
To: Thadeu Lima de Souza Cascardo <cascardo@...ux.vnet.ibm.com>
CC: "davem@...emloft.net" <davem@...emloft.net>,
"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
"yevgenyp@...lanox.co.il" <yevgenyp@...lanox.co.il>,
"ogerlitz@...lanox.com" <ogerlitz@...lanox.com>,
"amirv@...lanox.com" <amirv@...lanox.com>,
"brking@...ux.vnet.ibm.com" <brking@...ux.vnet.ibm.com>,
"leitao@...ux.vnet.ibm.com" <leitao@...ux.vnet.ibm.com>,
"klebers@...ux.vnet.ibm.com" <klebers@...ux.vnet.ibm.com>
Subject: Re: [PATCH] mlx4_en: map entire pages to increase throughput
On 07/16/2012 10:01 AM, Thadeu Lima de Souza Cascardo wrote:
> In its receive path, mlx4_en driver maps each page chunk that it pushes
> to the hardware and unmaps it when pushing it up the stack. This limits
> throughput to about 3Gbps on a Power7 8-core machine.
That seems rather extraordinarily low - Power7 is supposed to be a
rather high performance CPU. The last time I noticed O(3Gbit/s) on 10G
for bulk transfer was before the advent of LRO/GRO - that was in the x86
space though. Is mapping really that expensive with Power7?
> One solution is to map the entire allocated page at once. However, this
> requires that we keep track of every page fragment we give to a
> descriptor. We also need to work with the discipline that all fragments will
> be released (in the sense that it will not be reused by the driver
> anymore) in the order they are allocated to the driver.
>
> This requires that we don't reuse any fragments, every single one of
> them must be reallocated. We do that by releasing all the fragments that
> are processed and only after finished processing the descriptors, we
> start the refill.
>
> We also must somehow guarantee that we either refill all fragments in a
> descriptor or none at all, without resorting to giving up a page
> fragment that we would have already given. Otherwise, we would break the
> discipline of only releasing the fragments in the order they were
> allocated.
>
> This has passed page allocation fault injections (restricted to the
> driver by using required-start and required-end) and device hotplug
> while 16 TCP streams were able to deliver more than 9Gbps.
What is the effect on packet-per-second performance? (eg aggregate,
burst-mode netperf TCP_RR with TCP_NODELAY set or perhaps UDP_RR)
rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists