[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20201110120017.362f5ca5@carbon>
Date: Tue, 10 Nov 2020 12:00:17 +0100
From: Jesper Dangaard Brouer <brouer@...hat.com>
To: Lorenzo Bianconi <lorenzo@...nel.org>
Cc: netdev@...r.kernel.org, bpf@...r.kernel.org,
lorenzo.bianconi@...hat.com, davem@...emloft.net, kuba@...nel.org,
ilias.apalodimas@...aro.org, brouer@...hat.com
Subject: Re: [PATCH v4 net-next 2/5] net: page_pool: add bulk support for
ptr_ring
On Fri, 6 Nov 2020 19:19:08 +0100
Lorenzo Bianconi <lorenzo@...nel.org> wrote:
> Introduce the capability to batch page_pool ptr_ring refill since it is
> usually run inside the driver NAPI tx completion loop.
>
> Suggested-by: Jesper Dangaard Brouer <brouer@...hat.com>
> Signed-off-by: Lorenzo Bianconi <lorenzo@...nel.org>
> ---
> include/net/page_pool.h | 26 ++++++++++++++++
> net/core/page_pool.c | 66 ++++++++++++++++++++++++++++++++++-------
> net/core/xdp.c | 9 ++----
> 3 files changed, 84 insertions(+), 17 deletions(-)
My own suggestion to create __page_pool_put_page() for code sharing,
cause a performance regression, because compiler chooses not to inline
the function. This cause unnecessary code for in_serving_softirq() and
function call overhead.
I'm currently testing with __always_inline and likely/unlikely
annotation to get compiler to layout code better for I-cache. This
shows improvements in my benchmarks. Details in[1].
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/xdp_bulk_return01.org
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index ef98372facf6..31dac2ad4a1f 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
[...]
> @@ -362,8 +364,9 @@ static bool pool_page_reusable(struct page_pool *pool, struct page *page)
> * If the page refcnt != 1, then the page will be returned to memory
> * subsystem.
> */
> -void page_pool_put_page(struct page_pool *pool, struct page *page,
> - unsigned int dma_sync_size, bool allow_direct)
> +static struct page *
> +__page_pool_put_page(struct page_pool *pool, struct page *page,
> + unsigned int dma_sync_size, bool allow_direct)
> {
> /* This allocator is optimized for the XDP mode that uses
> * one-frame-per-page, but have fallbacks that act like the
> @@ -379,15 +382,12 @@ void page_pool_put_page(struct page_pool *pool, struct page *page,
> page_pool_dma_sync_for_device(pool, page,
> dma_sync_size);
>
> - if (allow_direct && in_serving_softirq())
> - if (page_pool_recycle_in_cache(page, pool))
> - return;
> + if (allow_direct && in_serving_softirq() &&
> + page_pool_recycle_in_cache(page, pool))
> + return NULL;
>
> - if (!page_pool_recycle_in_ring(pool, page)) {
> - /* Cache full, fallback to free pages */
> - page_pool_return_page(pool, page);
> - }
> - return;
> + /* page is candidate for bulking */
> + return page;
> }
> /* Fallback/non-XDP mode: API user have elevated refcnt.
> *
> @@ -405,9 +405,55 @@ void page_pool_put_page(struct page_pool *pool, struct page *page,
> /* Do not replace this with page_pool_return_page() */
> page_pool_release_page(pool, page);
> put_page(page);
> +
> + return NULL;
> +}
> +
> +void page_pool_put_page(struct page_pool *pool, struct page *page,
> + unsigned int dma_sync_size, bool allow_direct)
> +{
> + page = __page_pool_put_page(pool, page, dma_sync_size, allow_direct);
> + if (page && !page_pool_recycle_in_ring(pool, page)) {
> + /* Cache full, fallback to free pages */
> + page_pool_return_page(pool, page);
> + }
> }
> EXPORT_SYMBOL(page_pool_put_page);
>
> +/* Caller must not use data area after call, as this function overwrites it */
> +void page_pool_put_page_bulk(struct page_pool *pool, void **data,
> + int count)
> +{
> + int i, bulk_len = 0, pa_len = 0;
> +
> + for (i = 0; i < count; i++) {
> + struct page *page = virt_to_head_page(data[i]);
> +
> + page = __page_pool_put_page(pool, page, -1, false);
> + /* Approved for bulk recycling in ptr_ring cache */
> + if (page)
> + data[bulk_len++] = page;
> + }
> +
> + if (!bulk_len)
> + return;
> +
> + /* Bulk producer into ptr_ring page_pool cache */
> + page_pool_ring_lock(pool);
> + for (i = 0; i < bulk_len; i++) {
> + if (__ptr_ring_produce(&pool->ring, data[i]))
> + data[pa_len++] = data[i];
> + }
> + page_pool_ring_unlock(pool);
> +
> + /* ptr_ring cache full, free pages outside producer lock since
> + * put_page() with refcnt == 1 can be an expensive operation
> + */
> + for (i = 0; i < pa_len; i++)
> + page_pool_return_page(pool, data[i]);
> +}
> +EXPORT_SYMBOL(page_pool_put_page_bulk);
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Powered by blists - more mailing lists