[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <b5k5gan4nwyrjm5pm76oeyll55u2om2chnvx4tekgsw2y3zqpm@pa3e3hhfbpjf>
Date: Wed, 27 Aug 2025 07:08:18 +0000
From: Dragos Tatulea <dtatulea@...dia.com>
To: Christoph Paasch <cpaasch@...nai.com>,
Eric Dumazet <edumazet@...gle.com>
Cc: Saeed Mahameed <saeedm@...dia.com>, Leon Romanovsky <leon@...nel.org>,
Tariq Toukan <tariqt@...dia.com>, Mark Bloch <mbloch@...dia.com>,
Andrew Lunn <andrew+netdev@...n.ch>, "David S. Miller" <davem@...emloft.net>,
Jakub Kicinski <kuba@...nel.org>, Paolo Abeni <pabeni@...hat.com>,
Alexander Lobakin <aleksander.lobakin@...el.com>, Gal Pressman <gal@...dia.com>, linux-rdma@...r.kernel.org,
netdev@...r.kernel.org
Subject: Re: [PATCH net-next v3 2/2] net/mlx5: Avoid copying payload to the
skb's linear part
On Tue, Aug 26, 2025 at 01:31:44PM -0700, Christoph Paasch wrote:
> On Mon, Aug 25, 2025 at 11:38 PM Eric Dumazet <edumazet@...gle.com> wrote:
> >
> > On Mon, Aug 25, 2025 at 8:47 PM Christoph Paasch via B4 Relay
> > <devnull+cpaasch.openai.com@...nel.org> wrote:
> > >
> > > From: Christoph Paasch <cpaasch@...nai.com>
> > >
> > > mlx5e_skb_from_cqe_mpwrq_nonlinear() copies MLX5E_RX_MAX_HEAD (256)
> > > bytes from the page-pool to the skb's linear part. Those 256 bytes
> > > include part of the payload.
> > >
> > > When attempting to do GRO in skb_gro_receive, if headlen > data_offset
> > > (and skb->head_frag is not set), we end up aggregating packets in the
> > > frag_list.
> > >
> > > This is of course not good when we are CPU-limited. Also causes a worse
> > > skb->len/truesize ratio,...
> > >
> > > So, let's avoid copying parts of the payload to the linear part. The
> > > goal here is to err on the side of caution and prefer to copy too little
> > > instead of copying too much (because once it has been copied over, we
> > > trigger the above described behavior in skb_gro_receive).
> > >
> > > So, we can do a rough estimate of the header-space by looking at
> > > cqe_l3/l4_hdr_type. This is now done in mlx5e_cqe_estimate_hdr_len().
> > > We always assume that TCP timestamps are present, as that's the most common
> > > use-case.
> > >
> > > That header-len is then used in mlx5e_skb_from_cqe_mpwrq_nonlinear for
> > > the headlen (which defines what is being copied over). We still
> > > allocate MLX5E_RX_MAX_HEAD for the skb so that if the networking stack
> > > needs to call pskb_may_pull() later on, we don't need to reallocate
> > > memory.
> > >
> > > This gives a nice throughput increase (ARM Neoverse-V2 with CX-7 NIC and
> > > LRO enabled):
> > >
> > > BEFORE:
> > > =======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.01 32547.82
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 52531.67
> > >
> > > AFTER:
> > > ======
> > > (netserver pinned to core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,9 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 52896.06
> > >
> > > (netserver pinned to adjacent core receiving interrupts)
> > > $ netperf -H 10.221.81.118 -T 80,10 -P 0 -l 60 -- -m 256K -M 256K
> > > 87380 16384 262144 60.00 85094.90
> > >
> > > Additional tests across a larger range of parameters w/ and w/o LRO, w/
> > > and w/o IPv6-encapsulation, different MTUs (1500, 4096, 9000), different
> > > TCP read/write-sizes as well as UDP benchmarks, all have shown equal or
> > > better performance with this patch.
> > >
> > > Signed-off-by: Christoph Paasch <cpaasch@...nai.com>
> > > ---
> > > drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 49 ++++++++++++++++++++++++-
> > > 1 file changed, 48 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > index b8c609d91d11bd315e8fb67f794a91bd37cd28c0..050f3efca34f3b8984c30f335ee43f487fef33ac 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > > @@ -1991,13 +1991,54 @@ mlx5e_shampo_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> > > } while (data_bcnt);
> > > }
> > >
> > > +static u16
> > > +mlx5e_cqe_estimate_hdr_len(const struct mlx5_cqe64 *cqe, u16 cqe_bcnt)
> > > +{
> > > + u8 l3_type, l4_type;
> > > + u16 hdr_len;
> > > +
> > > + hdr_len = sizeof(struct ethhdr);
> > > +
> > > + if (cqe_has_vlan(cqe))
> > > + hdr_len += VLAN_HLEN;
> > > +
> > > + l3_type = get_cqe_l3_hdr_type(cqe);
> > > + if (l3_type == CQE_L3_HDR_TYPE_IPV4) {
> > > + hdr_len += sizeof(struct iphdr);
> > > + } else if (l3_type == CQE_L3_HDR_TYPE_IPV6) {
> > > + hdr_len += sizeof(struct ipv6hdr);
> > > + } else {
> > > + hdr_len = MLX5E_RX_MAX_HEAD;
> > > + goto out;
> > > + }
> > > +
> > > + l4_type = get_cqe_l4_hdr_type(cqe);
> > > + if (l4_type == CQE_L4_HDR_TYPE_UDP) {
> > > + hdr_len += sizeof(struct udphdr);
> > > + } else if (l4_type & (CQE_L4_HDR_TYPE_TCP_NO_ACK |
> > > + CQE_L4_HDR_TYPE_TCP_ACK_NO_DATA |
> > > + CQE_L4_HDR_TYPE_TCP_ACK_AND_DATA)) {
> > > + /* ACK_NO_ACK | ACK_NO_DATA | ACK_AND_DATA == 0x7, but
> > > + * the previous condition checks for _UDP which is 0x2.
> > > + *
> > > + * As we know that l4_type != 0x2, we can simply check
> > > + * if any of the bits of 0x7 is set.
> > > + */
> > > + hdr_len += sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
> > > + } else {
> > > + hdr_len = MLX5E_RX_MAX_HEAD;
> > > + }
> > > +
> > > +out:
> > > + return min3(hdr_len, cqe_bcnt, MLX5E_RX_MAX_HEAD);
> > > +}
> > > +
> >
> > Hi Christoph
> >
> > I wonder if you have tried to use eth_get_headlen() instead of yet
> > another dissector ?
>
> I just tried eth_get_headlen() out - and indeed, no measurable perf difference.
>
> I will submit a new version.
>
What are the advantages of using eth_get_headlen() besides the fact that
it is more exhaustive? It seems quite expensive compared to reading some
bits in the CQE and doing a few comparisons. Even if this cost is amortized
by the benefits in the good cases, in the non-aggregation cases it seems
more costly. What am I missing here?
Thanks,
Dragos
Powered by blists - more mailing lists