[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CANn89iJ5brG-tSdyEPYH67BL1rkU5CKfvUO4Jc03twfVFKFPqQ@mail.gmail.com>
Date: Mon, 25 Aug 2025 23:38:05 -0700
From: Eric Dumazet <edumazet@...gle.com>
To: cpaasch@...nai.com
Cc: Saeed Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>, Tariq Toukan <tariqt@...dia.com>,
Mark Bloch <mbloch@...dia.com>, Andrew Lunn <andrew+netdev@...n.ch>,
"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Alexander Lobakin <aleksander.lobakin@...el.com>, Gal Pressman <gal@...dia.com>,
Dragos Tatulea <dtatulea@...dia.com>, linux-rdma@...r.kernel.org, netdev@...r.kernel.org
Subject: Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the
skb's linear part
On Mon, Aug 25, 2025 at 8:47 PM Christoph Paasch via B4 Relay
<devnull+cpaasch.openai.com@...nel.org> wrote:
>
> From: Christoph Paasch <cpaasch@...nai.com>
>
> mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> bytes from the page-pool to the skb's linear part. Those 256 bytes
> include part of the payload.
>
> When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> (and skb->head_frag is not set), we end up aggregating packets in the
> frag_list.
>
> This is of course not good when we are CPU-limited. Also causes a worse
> skb->len/truesize ratio,...
>
> So, let's avoid copying parts of the payload to the linear part. The
> goal here is to err on the side of caution and prefer to copy too little
> instead of copying too much (because once it has been copied over, we
> trigger the above described behavior in skb_gro_receive).
>
> So, we can do a rough estimate of the header-space by looking at
> cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
> We always assume that TCP timestamps are present, as that's the most common
> use-case.
>
> That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> the headlen (which defines what is being copied over). We still
> allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> needs to call pskb_may_pull() later on, we don't need to reallocate
> memory.
>
> This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> LRO enabled):
>
> BEFORE:
> =======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.01 32547.82
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52531.67
>
> AFTER:
> ======
> (netserver pinned to core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 52896.06
>
> (netserver pinned to adjacent core receiving interrupts)
> $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> 87380 16384 262144 60.00 85094.90
>
> Additional tests across a larger range of parameters w/ and w/o LRO, w/
> and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> better performance with this patch.
>
> Signed-off-by: Christoph Paasch <cpaasch@...nai.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
> 1 file changed, 48 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> } while (data_bcnt);
> }
>
> +static u16
> +mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
> +{
> + u8 l3_type, l4_type;
> + u16 hdr_len;
> +
> + hdr_len = sizeof(struct ethhdr);
> +
> + if (cqe_has_vlan(cqe))
> + hdr_len += VLAN_HLEN;
> +
> + l3_type = get_cqe_l3_hdr_type(cqe);
> + if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
> + hdr_len += sizeof(struct iphdr);
> + } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
> + hdr_len += sizeof(struct ipv6hdr);
> + } else {
> + hdr_len = MLX5E_RX_MAX_HEAD;
> + goto out;
> + }
> +
> + l4_type = get_cqe_l4_hdr_type(cqe);
> + if (l4_type == CQE_L4_HDR_TYPE_UDP) {
> + hdr_len += sizeof(struct udphdr);
> + } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> + CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> + CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
> + /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
> + * the previous condition checks for _UDP which is 0x2.
> + *
> + * As we know that l4_type != 0x2, we can simply check
> + * if any of the bits of 0x7 is set.
> + */
> + hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> + } else {
> + hdr_len = MLX5E_RX_MAX_HEAD;
> + }
> +
> +out:
> + return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
> +}
> +
Hi Christoph
I wonder if you have tried to use eth_get_headlen() instead of yet
another dissector ?
I doubt you will see a performance difference.
commit cfecec56ae7c7c40f23fbdac04acee027ca3bd66
Author: Eric Dumazet <edumazet@...gle.com>
Date: Fri Sep 5 18:29:45 2014 -0700
mlx4: only pull headers into skb head
Use the new fancy eth_get_headlen() to pull exactly the headers
into skb->head.
This speeds up GRE traffic (or more generally tunneled traffuc),
as GRO can aggregate up to 17 MSS per GRO packet instead of 8.
(Pulling too much data was forcing GRO to keep 2 frags per MSS)
Signed-off-by: Eric Dumazet <edumazet@...gle.com>
Cc: Amir Vadai <amirv@...lanox.com>
Signed-off-by: David S. Miller <davem@...emloft.net>
> static struct sk_buff *
> mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> u32 page_idx)
> {
> struct mlx5e_frag_page *frag_page = &wi->alloc_units.frag_pages[page_idx];
> - u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> struct mlx5e_frag_page *head_page = frag_page;
> struct mlx5e_xdp_buff *mxbuf = &rq->mxbuf;
> u32 frag_offset = head_offset;
> @@ -2009,6 +2050,7 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> u32 linear_frame_sz;
> u16 linear_data_len;
> u16 linear_hr;
> + u16 headlen;
> void *va;
>
> prog = rcu_dereference(rq->xdp_prog);
> @@ -2039,6 +2081,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> net_prefetchw(va); /* xdp_frame data area */
> net_prefetchw(skb->data);
>
> + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> +
> frag_offset += headlen;
> byte_cnt -= headlen;
> linear_hr = skb_headroom(skb);
> @@ -2115,6 +2159,9 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> pagep->frags++;
> while (++pagep < frag_page);
> }
> +
> + headlen = mlx5e_cqe_estimate_hdr_len(cqe, cqe_bcnt);
> +
> __pskb_pull_tail(skb, headlen);
> } else {
> dma_addr_t addr;
>
> --
> 2.50.1
>
>
Powered by blists - more mailing lists