[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID:
<MN6PR16MB54501D6839EF07C7C33FD3EAB7E8A@MN6PR16MB5450.namprd16.prod.outlook.com>
Date: Wed, 15 Oct 2025 21:02:16 +0800
From: Mingrui Cui <mingruic@...look.com>
To: dtatulea@...dia.com
Cc: andrew+netdev@...n.ch,
davem@...emloft.net,
edumazet@...gle.com,
kuba@...nel.org,
leon@...nel.org,
linux-kernel@...r.kernel.org,
linux-rdma@...r.kernel.org,
mbloch@...dia.com,
mingruic@...look.com,
netdev@...r.kernel.org,
pabeni@...hat.com,
saeedm@...dia.com,
tariqt@...dia.com
Subject: Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size
Hi Dragos,
Thanks again for the feedback.
On Wed, 1 Oct 2025 13:02:02 +0000, Dragos Tatulea wrote:
> On Tue, Sep 30, 2025 at 07:33:11PM +0800, Mingrui Cui wrote:
> > When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
> > fragments per WQE, odd-indexed WQEs always share the same page with
> > their subsequent WQE, while WQEs consisting of 4 fragments does not.
> > However, this relationship does not hold for page sizes larger than 8K.
> > In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
> > won't share the same page with old WQEs.
> >
> > If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
> > page with its subsequent WQE, allocating a page for that WQE will
> > overwrite mlx5e_frag_page, preventing the original page from being
> > recycled. When the next WQE is processed, the newly allocated page will
> > be immediately recycled. In the next round, if these two WQEs are
> > handled in the same bulk, page_pool_defrag_page() will be called again
> > on the page, causing pp_frag_count to become negative[1].
> >
> > Moreover, this can also lead to memory corruption, as the page may have
> > already been returned to the page pool and re-allocated to another WQE.
> > And since skb_shared_info is stored at the end of the first fragment,
> > its frags->bv_page pointer can be overwritten, leading to an invalid
> > memory access when processing the skb[2].
> >
> > For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
> > Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
> > network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
> > and eventually crash[2].
> >
> > Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
> > size.
> >
> > [1]
> > WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
> > mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> > CPU: 9 PID: 0 Comm: swapper/9 Tainted: G W 6.6.0
> > walk_stackframe+0x0/0x190
> > show_stack+0x70/0x94
> > dump_stack_lvl+0x98/0xd8
> > dump_stack+0x2c/0x48
> > __warn+0x1c8/0x220
> > warn_slowpath_fmt+0x20c/0x230
> > mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> > mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
> > mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
> > mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
> > __napi_poll+0x58/0x2e0
> > net_rx_action+0x1a8/0x340
> > __do_softirq+0x2b8/0x480
> > [...]
> >
> > [2]
> > Unable to handle kernel paging request at virtual address 393837363534333a
> > Oops [#1]
> > CPU: 72 PID: 0 Comm: swapper/72 Tainted: G W 6.6.0
> > Trace:
> > walk_stackframe+0x0/0x190
> > show_stack+0x70/0x94
> > die+0x1d4/0x350
> > do_page_fault+0x630/0x690
> > entMM+0x120/0x130
> > napi_pp_put_page+0x30/0x160
> > skb_release_data+0x164/0x250
> > kfree_skb_list_reason+0xd0/0x2f0
> > skb_release_data+0x1f0/0x250
> > napi_consume_skb+0xa0/0x220
> > net_rx_action+0x158/0x340
> > __do_softirq+0x2b8/0x480
> > irq_exit+0xd4/0x120
> > do_entInt+0x164/0x520
> > entInt+0x114/0x120
> > [...]
> >
> > Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> > Signed-off-by: Mingrui Cui <mingruic@...look.com>
> > ---
> > Changes in v2:
> > - Add Fixes tag and more details to commit message.
> > - Target 'net' branch.
> > - Remove the obsolete WARN_ON() and update related comments.
> > Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
> >
> Thanks for your changes Mingrui.
>
> > .../net/ethernet/mellanox/mlx5/core/en/params.c | 15 +++++----------
> > 1 file changed, 5 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> > index 3cca06a74cf9..00b44da23e00 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> > @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
> > info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
> > }
> >
> > -#define DEFAULT_FRAG_SIZE (2048)
> > +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
> >
> Reasoning my way through this code as I can't test it on 8K page size:
> - 4K page size: nothing changes so looks good.
> - 8K page size:
> - Smaller MTUs will be using linear SKB, so frags are not used.
> Looks good.
> - Larger MTUs will have a frag size of 4K. The number of frags is
> still below MLX5E_MAX_RX_FRAGS. This is what this patch fixes and
> it looks good.
> - 16K page size or larger: as max HW MTU is somewhere in the 10-12K range
> all MTUs will result in linear SKBs. So looks good. But see below
> comment about keeping the warning.
>
> For non-linear XDP, frag_size_max will always be PAGE_SIZE. So this
> looks safe in all cases.
>
> XSK with smaller chunk sizes is still an open question. For 16K page
> size you could still get in non-linear mode.
>
> One thing to note is that mlx5e_max_nonlinear_mtu() will now depend on
> PAGE_SIZE as frag_size_max and first_frag_size_max are now MTU
> dependent. It wasn't the case previously. I think this is currently not
> an issue as this function is used only by mlx5e_build_rq_frags_info().
>
> > static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> > struct mlx5e_params *params,
> > @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
> > /* No WQE can start in the middle of a page. */
> > info->wqe_index_mask = 0;
> > } else {
> > - /* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> > - * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> > - */
> > - WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
> > -
> I would still keep a warning when reaching this area with a page size
> above 8K just because it was not tested.
Understood. I will add WARN_ON(PAGE_SIZE > 8192) in next patch as you suggested.
> > /* Odd number of fragments allows to pack the last fragment of
> > * the previous WQE and the first fragment of the next WQE into
> > * the same page.
> > - * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> > - * is 4, the last fragment can be bigger than the rest only if
> > - * it's the fourth one, so WQEs consisting of 3 fragments will
> > - * always share a page.
> > + * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> > + * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> > + * the rest only if it's the fourth one, so WQEs consisting of 3
> > + * fragments will always share a page.
> > * When a page is shared, WQE bulk size is 2, otherwise just 1.
> > */
> > info->wqe_index_mask = info->num_frags % 2;
>
> Have you tried testing this with KASAN on? As your platform is not very
> common, we want to make sure that there are no other issues.
Unfortunately, the CPU architecture of my test platform does not support KASAN,
so I'm unable to perform KASAN test.
However, I ran a continuous iperf test on the patched kernel for one week.
During this long-duration test, no crashes, new warnings, or other instabilities
were observed. Do I need other tests to confirm further?
Powered by blists - more mailing lists