linux-kernel - [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID:
 <MN6PR16MB5450C5EC9A1B2E2E78E8B241B71AA@MN6PR16MB5450.namprd16.prod.outlook.com>
Date: Tue, 30 Sep 2025 19:33:11 +0800
From: Mingrui Cui <mingruic@...look.com>
To: dtatulea@...dia.com
Cc: andrew+netdev@...n.ch,
	davem@...emloft.net,
	edumazet@...gle.com,
	kuba@...nel.org,
	leon@...nel.org,
	linux-kernel@...r.kernel.org,
	linux-rdma@...r.kernel.org,
	mbloch@...dia.com,
	mingruic@...look.com,
	netdev@...r.kernel.org,
	pabeni@...hat.com,
	saeedm@...dia.com,
	tariqt@...dia.com
Subject: [PATCH net v2] net/mlx5e: Make DEFAULT_FRAG_SIZE relative to page size

When page size is 4K, DEFAULT_FRAG_SIZE of 2048 ensures that with 3
fragments per WQE, odd-indexed WQEs always share the same page with
their subsequent WQE, while WQEs consisting of 4 fragments does not.
However, this relationship does not hold for page sizes larger than 8K.
In this case, wqe_index_mask cannot guarantee that newly allocated WQEs
won't share the same page with old WQEs.

If the last WQE in a bulk processed by mlx5e_post_rx_wqes() shares a
page with its subsequent WQE, allocating a page for that WQE will
overwrite mlx5e_frag_page, preventing the original page from being
recycled. When the next WQE is processed, the newly allocated page will
be immediately recycled. In the next round, if these two WQEs are
handled in the same bulk, page_pool_defrag_page() will be called again
on the page, causing pp_frag_count to become negative[1].

Moreover, this can also lead to memory corruption, as the page may have
already been returned to the page pool and re-allocated to another WQE.
And since skb_shared_info is stored at the end of the first fragment,
its frags->bv_page pointer can be overwritten, leading to an invalid
memory access when processing the skb[2].

For example, on 8K page size systems (e.g. DEC Alpha) with a ConnectX-4
Lx MT27710 (MCX4121A-ACA_Ax) NIC setting MTU to 7657 or higher, heavy
network loads (e.g. iperf) will first trigger a series of WARNINGs[1]
and eventually crash[2].

Fix this by making DEFAULT_FRAG_SIZE always equal to half of the page
size.

[1]
WARNING: CPU: 9 PID: 0 at include/net/page_pool/helpers.h:130
mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
CPU: 9 PID: 0 Comm: swapper/9 Tainted: G        W          6.6.0
 walk_stackframe+0x0/0x190
 show_stack+0x70/0x94
 dump_stack_lvl+0x98/0xd8
 dump_stack+0x2c/0x48
 __warn+0x1c8/0x220
 warn_slowpath_fmt+0x20c/0x230
 mlx5e_page_release_fragmented.isra.0+0xdc/0xf0 [mlx5_core]
 mlx5e_free_rx_wqes+0xcc/0x120 [mlx5_core]
 mlx5e_post_rx_wqes+0x1f4/0x4e0 [mlx5_core]
 mlx5e_napi_poll+0x1c0/0x8d0 [mlx5_core]
 __napi_poll+0x58/0x2e0
 net_rx_action+0x1a8/0x340
 __do_softirq+0x2b8/0x480
 [...]

[2]
Unable to handle kernel paging request at virtual address 393837363534333a
Oops [#1]
CPU: 72 PID: 0 Comm: swapper/72 Tainted: G        W          6.6.0
Trace:
 walk_stackframe+0x0/0x190
 show_stack+0x70/0x94
 die+0x1d4/0x350
 do_page_fault+0x630/0x690
 entMM+0x120/0x130
 napi_pp_put_page+0x30/0x160
 skb_release_data+0x164/0x250
 kfree_skb_list_reason+0xd0/0x2f0
 skb_release_data+0x1f0/0x250
 napi_consume_skb+0xa0/0x220
 net_rx_action+0x158/0x340
 __do_softirq+0x2b8/0x480
 irq_exit+0xd4/0x120
 do_entInt+0x164/0x520
 entInt+0x114/0x120
 [...]

Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
Signed-off-by: Mingrui Cui <mingruic@...look.com>
---
Changes in v2:
  - Add Fixes tag and more details to commit message.
  - Target 'net' branch.
  - Remove the obsolete WARN_ON() and update related comments.
Link to v1: https://lore.kernel.org/all/MN6PR16MB5450CAF432AE40B2AFA58F61B706A@MN6PR16MB5450.namprd16.prod.outlook.com/

 .../net/ethernet/mellanox/mlx5/core/en/params.c   | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
index 3cca06a74cf9..00b44da23e00 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/params.c
@@ -666,7 +666,7 @@ static void mlx5e_rx_compute_wqe_bulk_params(struct mlx5e_params *params,
 	info->refill_unit = DIV_ROUND_UP(info->wqe_bulk, split_factor);
 }
 
-#define DEFAULT_FRAG_SIZE (2048)
+#define DEFAULT_FRAG_SIZE (PAGE_SIZE / 2)
 
 static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 				     struct mlx5e_params *params,
@@ -756,18 +756,13 @@ static int mlx5e_build_rq_frags_info(struct mlx5_core_dev *mdev,
 		/* No WQE can start in the middle of a page. */
 		info->wqe_index_mask = 0;
 	} else {
-		/* PAGE_SIZEs starting from 8192 don't use 2K-sized fragments,
-		 * because there would be more than MLX5E_MAX_RX_FRAGS of them.
-		 */
-		WARN_ON(PAGE_SIZE != 2 * DEFAULT_FRAG_SIZE);
-
 		/* Odd number of fragments allows to pack the last fragment of
 		 * the previous WQE and the first fragment of the next WQE into
 		 * the same page.
-		 * As long as DEFAULT_FRAG_SIZE is 2048, and MLX5E_MAX_RX_FRAGS
-		 * is 4, the last fragment can be bigger than the rest only if
-		 * it's the fourth one, so WQEs consisting of 3 fragments will
-		 * always share a page.
+		 * As long as DEFAULT_FRAG_SIZE is (PAGE_SIZE / 2), and
+		 * MLX5E_MAX_RX_FRAGS is 4, the last fragment can be bigger than
+		 * the rest only if it's the fourth one, so WQEs consisting of 3
+		 * fragments will always share a page.
 		 * When a page is shared, WQE bulk size is 2, otherwise just 1.
 		 */
 		info->wqe_index_mask = info->num_frags % 2;
-- 
2.43.0