[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20160405220647.GA95458@ast-mbp.thefacebook.com>
Date: Tue, 5 Apr 2016 15:06:49 -0700
From: Alexei Starovoitov <alexei.starovoitov@...il.com>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: Brenden Blanco <bblanco@...mgrid.com>,
John Fastabend <john.fastabend@...il.com>,
Tom Herbert <tom@...bertland.com>,
Daniel Borkmann <daniel@...earbox.net>,
"David S. Miller" <davem@...emloft.net>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
ogerlitz@...lanox.com
Subject: Re: [RFC PATCH 1/5] bpf: add PHYS_DEV prog type for early driver
filter
On Tue, Apr 05, 2016 at 11:29:05AM +0200, Jesper Dangaard Brouer wrote:
> >
> > Of course, there are other pieces to accelerate:
> > 12.71% ksoftirqd/1 [mlx4_en] [k] mlx4_en_alloc_frags
> > 6.87% ksoftirqd/1 [mlx4_en] [k] mlx4_en_free_frag
> > 4.20% ksoftirqd/1 [kernel.vmlinux] [k] get_page_from_freelist
> > 4.09% swapper [mlx4_en] [k] mlx4_en_process_rx_cq
> > and I think Jesper's work on batch allocation is going help that a lot.
>
> Actually, it looks like all of this "overhead" comes from the page
> alloc/free (+ dma unmap/map). We would need a page-pool recycle
> mechanism to solve/remove this overhead. For the early drop case we
> might be able to hack recycle the page directly in the driver (and also
> avoid dma_unmap/map cycle).
Exactly. A cache of allocated and mapped pages will help a lot both drop
and redirect use cases. After tx completion we can recycle still mmaped
page into the cache (need to make sure to map them PCI_DMA_BIDIRECTIONAL)
and rx can refill the ring with it. For load balancer steady state
we won't have any calls to page allocator and dma.
Being able to do cheap percpu pool like this is a huge advantage
that any kernel bypass cannot have. I'm pretty sure it will be
possible to avoid local_cmpxchg as well.
Powered by blists - more mailing lists