netdev - Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the skb's linear part

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAMB2axNT0rF_ToMcj9yagZE3VqHhQpB7MX=zSem5J1gyDqPJcw@mail.gmail.com>
Date: Wed, 3 Sep 2025 17:11:55 -0700
From: Amery Hung <ameryhung@...il.com>
To: Christoph Paasch <cpaasch@...nai.com>
Cc: Gal Pressman <gal@...dia.com>, Dragos Tatulea <dtatulea@...dia.com>, 
	Saeed Mahameed <saeedm@...dia.com>, Tariq Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>, 
	Leon Romanovsky <leon@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>, 
	"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>, 
	Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>, Alexei Starovoitov <ast@...nel.org>, 
	Daniel Borkmann <daniel@...earbox.net>, Jesper Dangaard Brouer <hawk@...nel.org>, 
	John Fastabend <john.fastabend@...il.com>, Stanislav Fomichev <sdf@...ichev.me>, netdev@...r.kernel.org, 
	linux-rdma@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [PATCH net-next v4 2/2] net/mlx5: Avoid copying payload to the
 skb's linear part

On Wed, Sep 3, 2025 at 4:57 PM Christoph Paasch <cpaasch@...nai.com> wrote:
>
> On Wed, Sep 3, 2025 at 4:39 PM Amery Hung <ameryhung@...il.com> wrote:
> >
> >
> >
> > On 8/28/25 8:36 PM, Christoph Paasch via B4 Relay wrote:
> > > From: Christoph Paasch <cpaasch@...nai.com>
> > >
> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > include part of the payload.
> > >
> > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > frag_list.
> > >
> > > This is of course not good when we are CPU-limited. Also causes a worse
> > > skb->len/truesize ratio,...
> > >
> > > So, let's avoid copying parts of the payload to the linear part. We use
> > > eth_get_headlen() to parse the headers and compute the length of the
> > > protocol headers, which will be used to copy the relevant bits ot the
> > > skb's linear part.
> > >
> > > We still allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking
> > > stack needs to call pskb_may_pull() later on, we don't need to reallocate
> > > memory.
> > >
> > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > LRO enabled):
> > >
> > > BEFORE:
> > > =======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.01    32547.82
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.00    52531.67
> > >
> > > AFTER:
> > > ======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.00    52896.06
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > >   $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > >   87380  16384 262144    60.00    85094.90
> > >
> > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > better performance with this patch.
> > >
> > > Signed-off-by: Christoph Paasch <cpaasch@...nai.com>
> > > ---
> > >   drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 5 +++++
> > >   1 file changed, 5 insertions(+)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index 8bedbda522808cbabc8e62ae91a8c25d66725ebb..792bb647ba28668ad7789c328456e3609440455d 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -2047,6 +2047,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >               dma_sync_single_for_cpu(rq->pdev, addr + head_offset, headlen,
> > >                                       rq->buff.map_dir);
> > >
> > > +             headlen = eth_get_headlen(skb->dev, head_addr, headlen);
> > > +
> >
> > Hi,
> >
> > I am building on top of this patchset and got a kernel crash. It was
> > triggered by attaching an xdp program.
> >
> > I think the problem is skb->dev is still NULL here. It will be set later by:
> > mlx5e_complete_rx_cqe() -> mlx5e_build_rx_skb() -> eth_type_trans()
>
> Hmmm... Not sure what happened here...
> I'm almost certain I tested with xdp as well...
>
> I will try again later/tomorrow.
>

Here is the command that triggers the panic:

ip link set dev eth0 mtu 8000 xdp obj
/root/ksft-net-drv/net/lib/xdp_native.bpf.o sec xdp.frags

and I should have attached the log:

[ 2851.287387] BUG: kernel NULL pointer dereference, address: 0000000000000100
[ 2851.301329] #PF: supervisor read access in kernel mode
[ 2851.311602] #PF: error_code(0x0000) - not-present page
[ 2851.321879] PGD 0 P4D 0
[ 2851.326944] Oops: Oops: 0000 [#1] SMP
[ 2851.334272] CPU: 11 UID: 0 PID: 0 Comm: swapper/11 Kdump: loaded
Tainted: G S          E       6.17.0-rc1-gcf50ef415525 #305 NONE
[ 2851.357759] Tainted: [S]=CPU_OUT_OF_SPEC, [E]=UNSIGNED_MODULE
[ 2851.369252] Hardware name: Wiwynn Delta Lake MP/Delta Lake-Class1,
BIOS Y3DL401 09/04/2024
[ 2851.385787] RIP: 0010:eth_get_headlen+0x16/0x90
[ 2851.394850] Code: 5e 41 5f 5d c3 b8 f2 ff ff ff eb f0 cc cc cc cc
cc cc cc cc 0f 1f 44 00 00 41 56 53 48 83 ec 10 89 d3 83 fa 0e 72 68
49 89 f6 <48> 8b bf 00 01 00 00 44 0f b7 4e 0c c7 44 24 08 00 00 00 00
48 c7
[ 2851.432413] RSP: 0018:ffffc90000720cc8 EFLAGS: 00010212
[ 2851.442864] RAX: 0000000000000000 RBX: 000000000000008a RCX: 00000000000000a0
[ 2851.457141] RDX: 000000000000008a RSI: ffff8885a5aee100 RDI: 0000000000000000
[ 2851.471417] RBP: ffff8883d01f3900 R08: ffff888204c7c000 R09: 0000000000000000
[ 2851.485696] R10: ffff8883d01f3900 R11: ffff8885a5aee340 R12: ffff8885add00030
[ 2851.499969] R13: ffff8885add00030 R14: ffff8885a5aee100 R15: 0000000000000000
[ 2851.514245] FS:  0000000000000000(0000) GS:ffff8890b4427000(0000)
knlGS:0000000000000000
[ 2851.530433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2851.541931] CR2: 0000000000000100 CR3: 000000107d412003 CR4: 00000000007726f0
[ 2851.556208] PKRU: 55555554
[ 2851.561623] Call Trace:
[ 2851.566514]  <IRQ>
[ 2851.570540]  mlx5e_skb_from_cqe_mpwrq_nonlinear+0x7af/0x8d0
[ 2851.581689]  mlx5e_handle_rx_cqe_mpwrq+0xbc/0x180
[ 2851.591096]  mlx5e_poll_rx_cq+0x2ef/0x780
[ 2851.599114]  mlx5e_napi_poll+0x10c/0x710
[ 2851.606959]  __napi_poll+0x28/0x160
[ 2851.613934]  net_rx_action+0x1c0/0x350
[ 2851.621434]  ? mlx5_eq_comp_int+0xdf/0x190
[ 2851.629628]  ? sched_clock+0x5/0x10
[ 2851.636603]  ? sched_clock_cpu+0xc/0x170
[ 2851.644450]  handle_softirqs+0xd8/0x280
[ 2851.652121]  __irq_exit_rcu.llvm.7416059615185659459+0x44/0xd0
[ 2851.663788]  common_interrupt+0x85/0x90
[ 2851.671457]  </IRQ>
[ 2851.675653]  <TASK>
[ 2851.679850]  asm_common_interrupt+0x22/0x40

Thanks for taking a look!
Amery

> Thanks!
> Christoph
>
> >
> >
> > >               frag_offset += headlen;
> > >               byte_cnt -= headlen;
> > >               linear_hr = skb_headroom(skb);
> > > @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> > >                               pagep->frags++;
> > >                       while (++pagep < frag_page);
> > >               }
> > > +
> > > +             headlen = eth_get_headlen(skb->dev, mxbuf->xdp.data, headlen);
> > > +
> > >               __pskb_pull_tail(skb, headlen);
> > >       } else {
> > >               if (xdp_buff_has_frags(&mxbuf->xdp)) {
> > >
> >