netdev - Re: [PATCH RFC] net/mlx5e: avoid page pool frag counter underflow

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date: Fri, 29 Sep 2023 09:06:12 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: "dw@...idwei.uk" <dw@...idwei.uk>, "netdev@...r.kernel.org"
	<netdev@...r.kernel.org>, "clm@...com" <clm@...com>, Saeed Mahameed
	<saeedm@...dia.com>, "kuba@...nel.org" <kuba@...nel.org>
CC: Tariq Toukan <tariqt@...dia.com>
Subject: Re: [PATCH RFC] net/mlx5e: avoid page pool frag counter underflow

Hi Chris,

On Thu, 2023-09-28 at 16:47 -0700, Chris Mason wrote:
> [ This is just an RFC because I've wandered pretty far from home and
> really don't know the code at hand.  The errors are real though, ENOMEM during
> mlx5e_refill_rx_wqes() leads to underflows and system instability ]
> 
> mlx5e_refill_rx_wqes() has roughly the following flow:
> 
> 1) mlx5e_free_rx_wqes()
> 2) mlx5e_alloc_rx_wqes()
> 
> We're doing bulk frees before refilling the frags in bulk, and under
> normal conditions this is all well balanced.  Every time we try
> to refill_rx_wqes, the first thing we do is free the existing ones.
> 
> But, if we get an ENOMEM from mlx5e_get_rx_frag(), we will have called
> mlx5e_free_rx_wqes() on a bunch of frags without refilling the pages for
> them.
> 
> mlx5e_page_release_fragmented() doesn't take any steps to remember that
> a given frag has been put through page_pool_defrag_page(), and so in the
> ENOMEM case, repeated calls to free_rx_wqes without corresponding
> allocations end up underflowing in page_pool_defrag_page()
> 
>         ret = atomic_long_sub_return(nr, &page->pp_frag_count);
>         WARN_ON(ret < 0);
> 
> Reproducing this just needs a memory hog driving the system into OOM and
> a heavy network rx load.
> 
Yeap, this is a problem. Thanks for finding this!

> My guess at a fix is to update our frag to make sure we don't send it
> through defrag more than once.  I've only lightly tested this, but it doesn't
> immediately crash on OOM anymore and doesn't seem to leak.
> 
> Fixes: 6f5742846053c7 ("net/mlx5e: RX, Enable skb page recycling through the
> page_pool")
> Signed-off-by: Chris Mason <clm@...com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 3fd11b0761e0..9a7b10f0bba9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -298,6 +298,16 @@ static void mlx5e_page_release_fragmented(struct mlx5e_rq
> *rq,
>         u16 drain_count = MLX5E_PAGECNT_BIAS_MAX - frag_page->frags;
>         struct page *page = frag_page->page;
>  
> +       if (!page)
> +               return;
> +
Ideally we'd like to avoid this kind of broad check as it can hide other issues.

> +       /*
> +        * we're dropping all of our counts on this page, make sure we
> +        * don't do it again the next time we process this frag
> +        */
> +       frag_page->frags = 0;
> +       frag_page->page = NULL;
> +
>         if (page_pool_defrag_page(page, drain_count) == 0)
>                 page_pool_put_defragged_page(rq->page_pool, page, -1, true);
>  }

We already have a mechanism to avoid double releases: setting the
MLX5E_WQE_FRAG_SKIP_RELEASE bit on the mlx5e_wqe_frag_info flags parameter. When
mlx5e_alloc_rx_wqes fails we should set that bit on the remaining frag_pages.
This is for legacy rq mode, multi-packet wqe rq mode has to be handled as well
in a similar way.

If I send a patch later, would you be able to test it?

Thanks,
Dragos