netdev - Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKgT0UdaTbwvA+Q2t-yri1HqHzMdfMecL3Dqf1MMq39kF96ZKQ@mail.gmail.com>
Date:	Fri, 8 Apr 2016 09:12:26 -0700
From:	Alexander Duyck <alexander.duyck@...il.com>
To:	Jesper Dangaard Brouer <brouer@...hat.com>
Cc:	"Waskiewicz, PJ" <PJ.Waskiewicz@...app.com>,
	"lsf@...ts.linux-foundation.org" <lsf@...ts.linux-foundation.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>,
	"netdev@...r.kernel.org" <netdev@...r.kernel.org>,
	"bblanco@...mgrid.com" <bblanco@...mgrid.com>,
	"alexei.starovoitov@...il.com" <alexei.starovoitov@...il.com>,
	"James.Bottomley@...senPartnership.com" 
	<James.Bottomley@...senpartnership.com>,
	"tom@...bertland.com" <tom@...bertland.com>,
	"lsf-pc@...ts.linux-foundation.org" 
	<lsf-pc@...ts.linux-foundation.org>
Subject: Re: [Lsf] [LSF/MM TOPIC] Generic page-pool recycle facility?

On Thu, Apr 7, 2016 at 1:38 PM, Jesper Dangaard Brouer
<brouer@...hat.com> wrote:
> On Thu, 7 Apr 2016 19:48:50 +0000
> "Waskiewicz, PJ" <PJ.Waskiewicz@...app.com> wrote:
>
>> On Thu, 2016-04-07 at 16:17 +0200, Jesper Dangaard Brouer wrote:
>> > (Topic proposal for MM-summit)
>> >
>> > Network Interface Cards (NIC) drivers, and increasing speeds stress
>> > the page-allocator (and DMA APIs).  A number of driver specific
>> > open-coded approaches exists that work-around these bottlenecks in
>> > the
>> > page allocator and DMA APIs. E.g. open-coded recycle mechanisms, and
>> > allocating larger pages and handing-out page "fragments".
>> >
>> > I'm proposing a generic page-pool recycle facility, that can cover
>> > the
>> > driver use-cases, increase performance and open up for zero-copy RX.
>>
>> Is this based on the page recycle stuff from ixgbe that used to be in
>> the driver?  If so I'd really like to be part of the discussion.
>
> Okay, so it is not part of the driver any-longer?  I've studied the
> current ixgbe driver (and other NIC drivers) closely.  Do you have some
> code pointers, to this older code?

No, it is still in the driver.  I think when PJ said "used to" he was
referring to the fact that the code was present in the driver back
when he was working on it at Intel.

You have to realize that the page reuse code has been in the Intel
drivers for a long time.  I think I introduced it originally on igb in
July of 2008 as page recycling, commit bf36c1a0040c ("igb: add page
recycling support"), and it was copied over to ixgbe in September,
commit 762f4c571058 ("ixgbe: recycle pages in packet split mode").

> The likely-fastest recycle code I've see is in the bnx2x driver.  If
> you are interested see: bnx2x_reuse_rx_data().  Again is it a bit
> open-coded produce/consumer ring queue (which would be nice to also
> cleanup).

Yeah, that is essentially the same kind of code we have in
ixgbe_reuse_rx_page().  From what I can tell though the bnx2x doesn't
actually reuse the buffers in the common case.  That function is only
called in the copy-break and error cases to recycle the buffer so that
it doesn't have to be freed.

> To amortize the cost of allocating a single page, most other drivers
> use the trick of allocating a larger (compound) page, and partition
> this page into smaller "fragments".  Which also amortize the cost of
> dma_map/unmap (important on non-x86).

Right.  The only reason why I went the reuse route instead of the
compound page route is that I had speculated that you could still
bottleneck yourself since the issue I was trying to avoid was the
dma_map call hitting a global lock in IOMMU enabled systems.  With the
larger page route I could at best reduce the number of map calls to
1/16 or 1/32 of what it was.  By doing the page reuse I actually bring
it down to something approaching 0 as long as the buffers are being
freed in a reasonable timeframe.  This way the code would scale so I
wouldn't have to worry about how many rings were active at the same
time.

As PJ can attest we even saw bugs where the page reuse actually was
too effective in some cases leading to us carrying memory from one
node to another when the interrupt was migrated.  That was why we had
to add the code to force us to free the page if it came from another
node.

> This is actually problematic performance wise, because packet-data
> (in these page fragments) only get DMA_sync'ed, and is thus considered
> "read-only".  As netstack need to write packet headers, yet-another
> (writable) memory area is allocated per packet (plus the SKB meta-data
> struct).

Have you done any actual testing with build_skb recently that shows
how much of a gain there is to be had?  I'm just curious as I know I
saw a gain back in the day, but back when I ran that test we didn't
have things like napi_alloc_skb running around which should be a
pretty big win.  It might be useful to hack a driver such as ixgbe to
use build_skb and see if it is even worth the trouble to do it
properly.

Here is a patch I had generated back in 2013 to convert ixgbe over to
using build_skb, https://patchwork.ozlabs.org/patch/236044/.  You
might be able to updated to make it work against current ixgbe and
then could come back to us with data on what the actual gain is.  My
thought is the gain should have significantly decreased since back in
the day as we optimized napi_alloc_skb to the point where I think the
only real difference is probably the memcpy to pull the headers from
the page.

- Alex