lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <cyashxbmwfkcbqscgyudij5yzkso2twpswqcf7ev2h7a64n67k@zvl3ht2pxc2b>
Date: Wed, 1 Oct 2025 13:02:02 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Mingrui Cui <mingruic@...look.com>
Cc: andrew+netdev@...n.ch, davem@...emloft.net, edumazet@...gle.com, 
	kuba@...nel.org, leon@...nel.org, linux-kernel@...r.kernel.org, 
	linux-rdma@...r.kernel.org, mbloch@...dia.com, netdev@...r.kernel.org, pabeni@...hat.com, 
	saeedm@...dia.com, tariqt@...dia.com
Subject: Re: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to
 page size

On Tue, Sep 30, 2025 at 07:33:11PM +0800, Mingrui Cui wrote:
> When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
> fragments per WQE, odd-indexed WQEs always share the same page with
> their subsequent WQE, while WQEs consisting of 4 fragments does not.
> However, this relationship does not hold for page sizes larger than 8K.
> In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
> won't share the same page with old WQEs.
> 
> If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
> page with its subsequent WQE, allocating a page for that WQE will
> overwrite mlx5e_frag_page, preventing the original page from being
> recycled. When the next WQE is processed, the newly allocated page will
> be immediately recycled. In the next round, if these two WQEs are
> handled in the same bulk, page_pool_defrag_page() will be called again
> on the page, causing pp_frag_count to become negative[1].
> 
> Moreover, this can also lead to memory corruption, as the page may have
> already been returned to the page pool and re-allocated to another WQE.
> And since skb_shared_info is stored at the end of the first fragment,
> its frags->bv_page pointer can be overwritten, leading to an invalid
> memory access when processing the skb[2].
> 
> For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
> Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
> network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
> and eventually crash[2].
> 
> Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
> size.
> 
> [1]
> WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
> mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
> CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W          6.6.0
>  walk_stackframe+0x0/0x190
>  show_stack+0x70/0x94
>  dump_stack_lvl+0x98/0xd8
>  dump_stack+0x2c/0x48
>  __warn+0x1c8/0x220
>  warn_slowpath_fmt+0x20c/0x230
>  mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
>  mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
>  mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
>  mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
>  __napi_poll+0x58/0x2e0
>  net_rx_action+0x1a8/0x340
>  __do_softirq+0x2b8/0x480
>  [...]
> 
> [2]
> Unable to handle kernel paging request at virtual address 393837363534333a
> Oops [#1]
> CPU: 72 PID: 0 Comm: swapper/72 Tainted: G        W          6.6.0
> Trace:
>  walk_stackframe+0x0/0x190
>  show_stack+0x70/0x94
>  die+0x1d4/0x350
>  do_page_fault+0x630/0x690
>  entMM+0x120/0x130
>  napi_pp_put_page+0x30/0x160
>  skb_release_data+0x164/0x250
>  kfree_skb_list_reason+0xd0/0x2f0
>  skb_release_data+0x1f0/0x250
>  napi_consume_skb+0xa0/0x220
>  net_rx_action+0x158/0x340
>  __do_softirq+0x2b8/0x480
>  irq_exit+0xd4/0x120
>  do_entInt+0x164/0x520
>  entInt+0x114/0x120
>  [...]
> 
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> Signed-off-by: Mingrui Cui <mingruic@...look.com>
> ---
> Changes in v2:
>   - Add Fixes tag and more details to commit message.
>   - Target 'net' branch.
>   - Remove the obsolete WARN_ON() and update related comments.
> Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/
> 
Thanks for your changes Mingrui.

>  .../net/ethernet/mellanox/mlx5/core/en/params.c   | 15 +++++----------
>  1 file changed, 5 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> index 3cca06a74cf9..00b44da23e00 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
> @@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
>  	info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
>  }
>  
> -#define DEFAULT_FRAG_SIZE (2048)
> +#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
>
Reasoning my way through this code as I can't test it on 8K page size:
- 4K page size: nothing changes so looks good.
- 8K page size:
    - Smaller MTUs will be using linear SKB, so frags are not used.
      Looks good.
    - Larger MTUs will have a frag size of 4K. The number of frags is
      still below MLX5E_MAX_RX_FRAGS. This is what this patch fixes and
      it looks good.
- 16K page size or larger: as max HW MTU is somewhere in the 10-12K range
  all MTUs will result in linear SKBs. So looks good. But see below
  comment about keeping the warning.

For non-linear XDP, frag_size_max will always be PAGE_SIZE. So this
looks safe in all cases.

XSK with smaller chunk sizes is still an open question. For 16K page
size you could still get in non-linear mode.

One thing to note is that mlx5e_max_nonlinear_mtu() will now depend on
PAGE_SIZE as frag_size_max and first_frag_size_max are now MTU
dependent. It wasn't the case previously. I think this is currently not
an issue as this function is used only by mlx5e_build_rq_frags_info().

>  static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  				     struct mlx5e_params *params,
> @@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
>  		/* No WQE can start in the middle of a page. */
>  		info->wqe_index_mask = 0;
>  	} else {
> -		/* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
> -		 * because there would be more than MLX5E_MAX_RX_FRAGS of them.
> -		 */
> -		WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
> -
I would still keep a warning when reaching this area with a page size
above 8K just because it was not tested.

>  		/* Odd number of fragments allows to pack the last fragment of
>  		 * the previous WQE and the first fragment of the next WQE into
>  		 * the same page.
> -		 * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
> -		 * is 4, the last fragment can be bigger than the rest only if
> -		 * it's the fourth one, so WQEs consisting of 3 fragments will
> -		 * always share a page.
> +		 * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
> +		 * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
> +		 * the rest only if it's the fourth one, so WQEs consisting of 3
> +		 * fragments will always share a page.
>  		 * When a page is shared, WQE bulk size is 2, otherwise just 1.
>  		 */
>  		info->wqe_index_mask = info->num_frags % 2;

Have you tried testing this with KASAN on? As your platform is not very
common, we want to make sure that there are no other issues.

Thanks,
Dragos

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ