[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <adb592ee-2f2f-4208-bd4d-0c41bb319ff1@nvidia.com>
Date: Wed, 11 Feb 2026 18:12:05 +0100
From: Dragos Tatulea <dtatulea@...dia.com>
To: Amery Hung <ameryhung@...il.com>, Christoph Paasch <cpaasch@...nai.com>
Cc: Gal Pressman <gal@...dia.com>, Saeed Mahameed <saeedm@...dia.com>,
Tariq Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>,
Leon Romanovsky <leon@...nel.org>, Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>, Eric Dumazet <edumazet@...gle.com>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Alexei Starovoitov <ast@...nel.org>, Daniel Borkmann <daniel@...earbox.net>,
Jesper Dangaard Brouer <hawk@...nel.org>,
John Fastabend <john.fastabend@...il.com>,
Stanislav Fomichev <sdf@...ichev.me>, netdev@...r.kernel.org,
linux-rdma@...r.kernel.org, bpf@...r.kernel.org
Subject: Re: [PATCH net-next v5 2/2] net/mlx5: Avoid copying payload to the
skb's linear part
Hi Christoph,
On 10.09.25 21:09, Amery Hung wrote:
> On Wed, Sep 10, 2025 at 1:36 PM Christoph Paasch <cpaasch@...nai.com> wrote:
[...]
>>>>>>> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
>>>>>>> LRO enabled):
>>>>>>>
>>>>>>> BEFORE:
>>>>>>> =======
>>>>>>> (netserver pinned to core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>> 87380 16384 262144 60.01 32547.82
>>>>>>>
>>>>>>> (netserver pinned to adjacent core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>> 87380 16384 262144 60.00 52531.67
>>>>>>>
>>>>>>> AFTER:
>>>>>>> ======
>>>>>>> (netserver pinned to core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>> 87380 16384 262144 60.00 52896.06
>>>>>>>
>>>>>>> (netserver pinned to adjacent core receiving interrupts)
>>>>>>> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
>>>>>>> 87380 16384 262144 60.00 85094.90
>>>>>>>
We still want this nice optimization. Seems like last discussion was
stuck on reading the header length for XDP. See below for sugession.
[...]
>>>>>>> @@ -2123,6 +2125,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>>>>> pagep->frags++;
>>>>>>> while (++pagep < frag_page);
>>>>>>> }
>>>>>>> +
>>>>>>> + headlen = eth_get_headlen(rq->netdev, mxbuf->xdp.data, headlen);
>>>>>>> +
>>>>>>
>>>>>> The size of mxbuf->xdp.data is most likely not headlen here.
>>>>>>
>>>>>> The driver currently generates a xdp_buff with empty linear data, pass
>>>>>> it to the xdp program and assumes the layout If the xdp program does
>>>>>> not change the layout of the xdp_buff through bpf_xdp_adjust_head() or
>>>>>> bpf_xdp_adjust_tail(). The assumption is not correct and I am working
>>>>>> on a fix. But, if we keep that assumption for now, mxbuf->xdp.data
>>>>>> will not contain any headers or payload. The thing that you try to do
>>>>>> probably should be:
>>>>>>
>>>>>> skb_frag_t *frag = &sinfo->frags[0];
>>>>>>
>>>>>> headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
>>>>>> skb_frag_size(frag));
>>>>
>>>> So, when I look at the headlen I get, it is correct (even with my old
>>>> code using mxbuf->xdp.data).
>>>>
>>>> To make sure I test the right thing, which scenario would
>>>> mxbuf->xdp.data not contain any headers or payload ? What do I need to
>>>> do to reproduce that ?
>>>
>>> A quick look at the code, could it be that
>>> skb_flow_dissect_flow_keys_basic() returns false so that
>>> eth_get_headlen() always returns sizeof(*eth)?
>>
>> No, the headlen values were correct (meaning, it was the actual length
>> of the headers):
>>
>
> Another possibility is that mxbuf->xdp is reused and not zeroed
> between calls to mlx5e_skb_from_cqe_mpwrq_nonlinear(). The stale
> headers might have been written to mxbuf->xdp.data before the XDP is
> attached.
>
> I am not sure what exactly happens, but my main question remains. when
> the XDP program is attached and does nothing, the linear data will be
> empty, so what is eth_get_headlen() parsing here...?
>
AFAIU there could be 2 cases here:
1) XDP program doesn't modify the XDP buffer so the header is in the
first frag,
2) XDP program modifies the XDP buffer and pulls some len bytes into the
linear part.
So how about always reading headlen from first frag *before* the XDP
program execution. And afterwards calculate the new headlen for
__pskb_pull_tail() based on previous headlen - len? Like this (untested):
```
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -2159,8 +2159,12 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
if (prog) {
u8 nr_frags_free, old_nr_frags = sinfo->nr_frags;
+ skb_frag_t *frag = &sinfo->frags[0];
u32 len;
+ headlen = eth_get_headlen(rq->netdev, skb_frag_address(frag),
+ skb_frag_size(frag));
+
if (mlx5e_xdp_handle(rq, prog, mxbuf)) {
if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT,
rq->flags)) {
@@ -2208,8 +2212,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
pagep->frags++;
while (++pagep < frag_page);
- headlen = min_t(u16, MLX5E_RX_MAX_HEAD - len,
- skb->data_len);
+ headlen = min_t(u16, headlen - len, skb->data_len);
__pskb_pull_tail(skb, headlen);
}
} else {
```
Amery, do you agree?
Would you like to re-send the patch after the merge window closes?
If you are busy we can also pick it up ourselves, do the required
modifications and send it keeping your authorship and the existing tags.
Thanks,
Dragos
Powered by blists - more mailing lists