[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20190614110156.GA6754@apalos>
Date: Fri, 14 Jun 2019 14:01:56 +0300
From: Ilias Apalodimas <ilias.apalodimas@...aro.org>
To: Jesper Dangaard Brouer <brouer@...hat.com>
Cc: netdev@...r.kernel.org,
Toke Høiland-Jørgensen <toke@...e.dk>,
Tariq Toukan <tariqt@...lanox.com>, toshiaki.makita1@...il.com,
grygorii.strashko@...com, ivan.khoronzhuk@...aro.org,
mcroce@...hat.com
Subject: Re: [PATCH net-next v1 08/11] xdp: tracking page_pool resources and
safe removal
Hi Jesper,
Minot nit picks mostly,
On Thu, Jun 13, 2019 at 08:28:42PM +0200, Jesper Dangaard Brouer wrote:
> This patch is needed before we can allow drivers to use page_pool for
> DMA-mappings. Today with page_pool and XDP return API, it is possible to
> remove the page_pool object (from rhashtable), while there are still
> in-flight packet-pages. This is safely handled via RCU and failed lookups in
> __xdp_return() fallback to call put_page(), when page_pool object is gone.
> In-case page is still DMA mapped, this will result in page note getting
> correctly DMA unmapped.
>
> To solve this, the page_pool is extended with tracking in-flight pages. And
> XDP disconnect system queries page_pool and waits, via workqueue, for all
> in-flight pages to be returned.
>
> To avoid killing performance when tracking in-flight pages, the implement
> use two (unsigned) counters, that in placed on different cache-lines, and
> can be used to deduct in-flight packets. This is done by mapping the
> unsigned "sequence" counters onto signed Two's complement arithmetic
> operations. This is e.g. used by kernel's time_after macros, described in
> kernel commit 1ba3aab3033b and 5a581b367b5, and also explained in RFC1982.
>
> The trick is these two incrementing counters only need to be read and
> compared, when checking if it's safe to free the page_pool structure. Which
> will only happen when driver have disconnected RX/alloc side. Thus, on a
> non-fast-path.
>
> It is chosen that page_pool tracking is also enabled for the non-DMA
> use-case, as this can be used for statistics later.
>
> After this patch, using page_pool requires more strict resource "release",
> e.g. via page_pool_release_page() that was introduced in this patchset, and
> previous patches implement/fix this more strict requirement.
>
> Drivers no-longer call page_pool_destroy(). Drivers already call
> xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
> attempt to disconnect the mem id, and if attempt fails schedule the
> disconnect for later via delayed workqueue.
>
> Signed-off-by: Jesper Dangaard Brouer <brouer@...hat.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 3 -
> include/net/page_pool.h | 41 ++++++++++---
> net/core/page_pool.c | 62 +++++++++++++++-----
> net/core/xdp.c | 65 +++++++++++++++++++--
> 4 files changed, 136 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 2f647be292b6..6c9d4d7defbc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -643,9 +643,6 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
> }
>
> xdp_rxq_info_unreg(&rq->xdp_rxq);
> - if (rq->page_pool)
> - page_pool_destroy(rq->page_pool);
> -
> mlx5_wq_destroy(&rq->wq_ctrl);
> }
>
> diff --git a/include/net/page_pool.h b/include/net/page_pool.h
> index 754d980700df..f09b3f1994e6 100644
> --- a/include/net/page_pool.h
> +++ b/include/net/page_pool.h
> @@ -16,14 +16,16 @@
> * page_pool_alloc_pages() call. Drivers should likely use
> * page_pool_dev_alloc_pages() replacing dev_alloc_pages().
> *
> - * If page_pool handles DMA mapping (use page->private), then API user
> - * is responsible for invoking page_pool_put_page() once. In-case of
> - * elevated refcnt, the DMA state is released, assuming other users of
> - * the page will eventually call put_page().
> + * API keeps track of in-flight pages, in-order to let API user know
> + * when it is safe to dealloactor page_pool object. Thus, API users
> + * must make sure to call page_pool_release_page() when a page is
> + * "leaving" the page_pool. Or call page_pool_put_page() where
> + * appropiate. For maintaining correct accounting.
> *
> - * If no DMA mapping is done, then it can act as shim-layer that
> - * fall-through to alloc_page. As no state is kept on the page, the
> - * regular put_page() call is sufficient.
> + * API user must only call page_pool_put_page() once on a page, as it
> + * will either recycle the page, or in case of elevated refcnt, it
> + * will release the DMA mapping and in-flight state accounting. We
> + * hope to lift this requirement in the future.
> */
> #ifndef _NET_PAGE_POOL_H
> #define _NET_PAGE_POOL_H
> @@ -66,9 +68,10 @@ struct page_pool_params {
> };
>
> struct page_pool {
> - struct rcu_head rcu;
> struct page_pool_params p;
>
> + u32 pages_state_hold_cnt;
Maybe mention the different cache-line placement for pages_state_hold_cnt and
pages_state_release_cnt as described in the commit message for future editors?
> +
> /*
> * Data structure for allocation side
> *
> @@ -96,6 +99,8 @@ struct page_pool {
> * TODO: Implement bulk return pages into this structure.
> */
> struct ptr_ring ring;
> +
> + atomic_t pages_state_release_cnt;
As we discussed this can change to per-cpu release variables if atomics end up
hurting performance. I am fine with leaving this as-is and optimize if we spot
any bottlenecks
> };
>
> struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp);
> @@ -109,8 +114,6 @@ static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
>
> struct page_pool *page_pool_create(const struct page_pool_params *params);
>
> -void page_pool_destroy(struct page_pool *pool);
> -
> void __page_pool_free(struct page_pool *pool);
> static inline void page_pool_free(struct page_pool *pool)
> {
> @@ -143,6 +146,24 @@ static inline void page_pool_recycle_direct(struct page_pool *pool,
> __page_pool_put_page(pool, page, true);
> }
>
> +/* API user MUST have disconnected alloc-side (not allowed to call
> + * page_pool_alloc_pages()) before calling this. The free-side can
> + * still run concurrently, to handle in-flight packet-pages.
> + *
> + * A request to shutdown can fail (with false) if there are still
> + * in-flight packet-pages.
> + */
> +bool __page_pool_request_shutdown(struct page_pool *pool);
> +static inline bool page_pool_request_shutdown(struct page_pool *pool)
> +{
> + /* When page_pool isn't compiled-in, net/core/xdp.c doesn't
> + * allow registering MEM_TYPE_PAGE_POOL, but shield linker.
> + */
> +#ifdef CONFIG_PAGE_POOL
> + return __page_pool_request_shutdown(pool);
> +#endif
> +}
> +
> /* Disconnects a page (from a page_pool). API users can have a need
> * to disconnect a page (from a page_pool), to allow it to be used as
> * a regular page (that will eventually be returned to the normal
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 41391b5dc14c..8679e24fd665 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -43,6 +43,8 @@ static int page_pool_init(struct page_pool *pool,
> if (ptr_ring_init(&pool->ring, ring_qsize, GFP_KERNEL) < 0)
> return -ENOMEM;
>
> + atomic_set(&pool->pages_state_release_cnt, 0);
> +
> return 0;
> }
>
> @@ -151,6 +153,9 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> page->dma_addr = dma;
>
> skip_dma_map:
> + /* Track how many pages are held 'in-flight' */
> + pool->pages_state_hold_cnt++;
> +
> /* When page just alloc'ed is should/must have refcnt 1. */
> return page;
> }
> @@ -173,6 +178,33 @@ struct page *page_pool_alloc_pages(struct page_pool *pool, gfp_t gfp)
> }
> EXPORT_SYMBOL(page_pool_alloc_pages);
>
> +/* Calculate distance between two u32 values, valid if distance is below 2^(31)
> + * https://en.wikipedia.org/wiki/Serial_number_arithmetic#General_Solution
> + */
> +#define _distance(a, b) (s32)((a) - (b))
> +
> +static s32 page_pool_inflight(struct page_pool *pool)
> +{
> + u32 release_cnt = atomic_read(&pool->pages_state_release_cnt);
> + u32 hold_cnt = READ_ONCE(pool->pages_state_hold_cnt);
> + s32 distance;
> +
> + distance = _distance(hold_cnt, release_cnt);
> +
> + /* TODO: Add tracepoint here */
> + return distance;
> +}
> +
> +static bool __page_pool_safe_to_destroy(struct page_pool *pool)
> +{
> + s32 inflight = page_pool_inflight(pool);
> +
> + /* The distance should not be able to become negative */
> + WARN(inflight < 0, "Negative(%d) inflight packet-pages", inflight);
> +
> + return (inflight == 0);
> +}
> +
> /* Cleanup page_pool state from page */
> static void __page_pool_clean_page(struct page_pool *pool,
> struct page *page)
> @@ -180,7 +212,7 @@ static void __page_pool_clean_page(struct page_pool *pool,
> dma_addr_t dma;
>
> if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> - return;
> + goto skip_dma_unmap;
>
> dma = page->dma_addr;
> /* DMA unmap */
> @@ -188,11 +220,16 @@ static void __page_pool_clean_page(struct page_pool *pool,
> PAGE_SIZE << pool->p.order, pool->p.dma_dir,
> DMA_ATTR_SKIP_CPU_SYNC);
> page->dma_addr = 0;
> +skip_dma_unmap:
> + atomic_inc(&pool->pages_state_release_cnt);
> }
>
> /* unmap the page and clean our state */
> void page_pool_unmap_page(struct page_pool *pool, struct page *page)
> {
> + /* When page is unmapped, this implies page will not be
> + * returned to page_pool.
> + */
> __page_pool_clean_page(pool, page);
> }
> EXPORT_SYMBOL(page_pool_unmap_page);
> @@ -201,6 +238,7 @@ EXPORT_SYMBOL(page_pool_unmap_page);
> static void __page_pool_return_page(struct page_pool *pool, struct page *page)
> {
> __page_pool_clean_page(pool, page);
> +
> put_page(page);
> /* An optimization would be to call __free_pages(page, pool->p.order)
> * knowing page is not part of page-cache (thus avoiding a
> @@ -296,24 +334,17 @@ void __page_pool_free(struct page_pool *pool)
> {
> WARN(pool->alloc.count, "API usage violation");
> WARN(!ptr_ring_empty(&pool->ring), "ptr_ring is not empty");
> + WARN(!__page_pool_safe_to_destroy(pool), "still in-flight pages");
>
> ptr_ring_cleanup(&pool->ring, NULL);
> kfree(pool);
> }
> EXPORT_SYMBOL(__page_pool_free);
>
> -static void __page_pool_destroy_rcu(struct rcu_head *rcu)
> -{
> - struct page_pool *pool;
> -
> - pool = container_of(rcu, struct page_pool, rcu);
> -
> - __page_pool_empty_ring(pool);
> - __page_pool_free(pool);
> -}
> -
> -/* Cleanup and release resources */
> -void page_pool_destroy(struct page_pool *pool)
> +/* Request to shutdown: release pages cached by page_pool, and check
> + * for in-flight pages
> + */
> +bool __page_pool_request_shutdown(struct page_pool *pool)
> {
> struct page *page;
>
> @@ -331,7 +362,6 @@ void page_pool_destroy(struct page_pool *pool)
> */
> __page_pool_empty_ring(pool);
>
> - /* An xdp_mem_allocator can still ref page_pool pointer */
> - call_rcu(&pool->rcu, __page_pool_destroy_rcu);
> + return __page_pool_safe_to_destroy(pool);
> }
> -EXPORT_SYMBOL(page_pool_destroy);
> +EXPORT_SYMBOL(__page_pool_request_shutdown);
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 179d90570afe..2b7bad227030 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -38,6 +38,7 @@ struct xdp_mem_allocator {
> };
> struct rhash_head node;
> struct rcu_head rcu;
> + struct delayed_work defer_wq;
> };
>
> static u32 xdp_mem_id_hashfn(const void *data, u32 len, u32 seed)
> @@ -79,13 +80,13 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
>
> xa = container_of(rcu, struct xdp_mem_allocator, rcu);
>
> + /* Allocator have indicated safe to remove before this is called */
> + if (xa->mem.type == MEM_TYPE_PAGE_POOL)
> + page_pool_free(xa->page_pool);
> +
> /* Allow this ID to be reused */
> ida_simple_remove(&mem_id_pool, xa->mem.id);
>
> - /* Notice, driver is expected to free the *allocator,
> - * e.g. page_pool, and MUST also use RCU free.
> - */
> -
> /* Poison memory */
> xa->mem.id = 0xFFFF;
> xa->mem.type = 0xF0F0;
> @@ -94,6 +95,46 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head *rcu)
> kfree(xa);
> }
>
> +bool __mem_id_disconnect(int id)
> +{
> + struct xdp_mem_allocator *xa;
> + bool safe_to_remove = true;
> +
> + mutex_lock(&mem_id_lock);
> +
> + xa = rhashtable_lookup_fast(mem_id_ht, &id, mem_id_rht_params);
> + if (!xa) {
> + mutex_unlock(&mem_id_lock);
> + WARN(1, "Request remove non-existing id(%d), driver bug?", id);
> + return true;
> + }
> +
> + /* Detects in-flight packet-pages for page_pool */
> + if (xa->mem.type == MEM_TYPE_PAGE_POOL)
> + safe_to_remove = page_pool_request_shutdown(xa->page_pool);
> +
> + if (safe_to_remove &&
> + !rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params))
> + call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
> +
> + mutex_unlock(&mem_id_lock);
> + return safe_to_remove;
> +}
> +
> +#define DEFER_TIME (msecs_to_jiffies(1000))
> +
> +static void mem_id_disconnect_defer_retry(struct work_struct *wq)
> +{
> + struct delayed_work *dwq = to_delayed_work(wq);
> + struct xdp_mem_allocator *xa = container_of(dwq, typeof(*xa), defer_wq);
> +
> + if (__mem_id_disconnect(xa->mem.id))
> + return;
> +
> + /* Still not ready to be disconnected, retry later */
> + schedule_delayed_work(&xa->defer_wq, DEFER_TIME);
> +}
> +
> void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
Again as we already discussed, the naming for destroying a registered page_pool
is mixed with xdp_ prefixes. That's fine but we'll need to document it properly
once real drivers start using this
> {
> struct xdp_mem_allocator *xa;
> @@ -112,16 +153,28 @@ void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
> if (id == 0)
> return;
>
> + if (__mem_id_disconnect(id))
> + return;
> +
> + /* Could not disconnect, defer new disconnect attempt to later */
> mutex_lock(&mem_id_lock);
>
> xa = rhashtable_lookup_fast(mem_id_ht, &id, mem_id_rht_params);
> - if (xa && !rhashtable_remove_fast(mem_id_ht, &xa->node, mem_id_rht_params))
> - call_rcu(&xa->rcu, __xdp_mem_allocator_rcu_free);
> + if (!xa) {
> + mutex_unlock(&mem_id_lock);
> + return;
> + }
>
> + INIT_DELAYED_WORK(&xa->defer_wq, mem_id_disconnect_defer_retry);
> mutex_unlock(&mem_id_lock);
> + schedule_delayed_work(&xa->defer_wq, DEFER_TIME);
> }
> EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg_mem_model);
>
> +/* This unregister operation will also cleanup and destroy the
> + * allocator. The page_pool_free() operation is first called when it's
> + * safe to remove, possibly deferred to a workqueue.
> + */
> void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
> {
> /* Simplify driver cleanup code paths, allow unreg "unused" */
>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@...aro.org>
Powered by blists - more mailing lists