netdev - Re: [PATCH iwl-net v2 3/3] ixgbe: xsk: support batched xsk Tx interfaces to increase performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aJzUqg5m3sKPWDe0@boxer>
Date: Wed, 13 Aug 2025 20:08:42 +0200
From: Maciej Fijalkowski <maciej.fijalkowski@...el.com>
To: Jason Xing <kerneljasonxing@...il.com>
CC: <davem@...emloft.net>, <edumazet@...gle.com>, <kuba@...nel.org>,
	<pabeni@...hat.com>, <horms@...nel.org>, <andrew+netdev@...n.ch>,
	<anthony.l.nguyen@...el.com>, <przemyslaw.kitszel@...el.com>,
	<sdf@...ichev.me>, <larysa.zaremba@...el.com>,
	<intel-wired-lan@...ts.osuosl.org>, <netdev@...r.kernel.org>, Jason Xing
	<kernelxing@...cent.com>
Subject: Re: [PATCH iwl-net v2 3/3] ixgbe: xsk: support batched xsk Tx
 interfaces to increase performance

On Wed, Aug 13, 2025 at 08:34:52AM +0800, Jason Xing wrote:
> Hi Maciej,
> 
> On Tue, Aug 12, 2025 at 11:42 PM Maciej Fijalkowski
> <maciej.fijalkowski@...el.com> wrote:
> >
> > On Tue, Aug 12, 2025 at 03:55:04PM +0800, Jason Xing wrote:
> > > From: Jason Xing <kernelxing@...cent.com>
> > >
> >
> > Hi Jason,
> >
> > patches should be targetted at iwl-next as these are improvements, not
> > fixes.
> 
> Oh, right.
> 
> >
> > > Like what i40e driver initially did in commit 3106c580fb7cf
> > > ("i40e: Use batched xsk Tx interfaces to increase performance"), use
> > > the batched xsk feature to transmit packets.
> > >
> > > Signed-off-by: Jason Xing <kernelxing@...cent.com>
> > > ---
> > > In this version, I still choose use the current implementation. Last
> > > time at the first glance, I agreed 'i' is useless but it is not.
> > > https://lore.kernel.org/intel-wired-lan/CAL+tcoADu-ZZewsZzGDaL7NugxFTWO_Q+7WsLHs3Mx-XHjJnyg@mail.gmail.com/
> >
> > dare to share the performance improvement (if any, in the current form)?
> 
> I tested the whole series, sorry, no actual improvement could be seen
> through xdpsock. Not even with the first series. :(

So if i were you i would hesitate with posting it :P in the past batching
approaches always yielded performance gain.

> 
> >
> > also you have not mentioned in v1->v2 that you dropped the setting of
> > xdp_zc_max_segs, which is a step in a correct path.
> 
> Oops, I blindly dropped the last patch without carefully checking it.
> Thanks for showing me.
> 
> I set it as four for ixgbe. I'm not that sure if there is any theory
> behind setting this value?

you're confusing two different things. xdp_zc_max_segs is related to
multi-buffer support in xsk zc whereas you're referring to loop unrolling
counter.

> 
> >
> > > ---
> > >  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 106 +++++++++++++------
> > >  1 file changed, 72 insertions(+), 34 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> > > index f3d3f5c1cdc7..9fe2c4bf8bc5 100644
> > > --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> > > +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> > > @@ -2,12 +2,15 @@
> > >  /* Copyright(c) 2018 Intel Corporation. */
> > >
> > >  #include <linux/bpf_trace.h>
> > > +#include <linux/unroll.h>
> > >  #include <net/xdp_sock_drv.h>
> > >  #include <net/xdp.h>
> > >
> > >  #include "ixgbe.h"
> > >  #include "ixgbe_txrx_common.h"
> > >
> > > +#define PKTS_PER_BATCH 4
> > > +
> > >  struct xsk_buff_pool *ixgbe_xsk_pool(struct ixgbe_adapter *adapter,
> > >                                    struct ixgbe_ring *ring)
> > >  {
> > > @@ -388,58 +391,93 @@ void ixgbe_xsk_clean_rx_ring(struct ixgbe_ring *rx_ring)
> > >       }
> > >  }
> > >
> > > -static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
> > > +static void ixgbe_set_rs_bit(struct ixgbe_ring *xdp_ring)
> > > +{
> > > +     u16 ntu = xdp_ring->next_to_use ? xdp_ring->next_to_use - 1 : xdp_ring->count - 1;
> > > +     union ixgbe_adv_tx_desc *tx_desc;
> > > +
> > > +     tx_desc = IXGBE_TX_DESC(xdp_ring, ntu);
> > > +     tx_desc->read.cmd_type_len |= cpu_to_le32(IXGBE_TXD_CMD_RS);
> >
> > you have not addressed the descriptor cleaning path which makes this
> > change rather pointless or even the driver behavior is broken.
> 
> Are you referring to 'while (ntc != ntu) {}' in
> ixgbe_clean_xdp_tx_irq()? But I see no difference between that part
> and the similar part 'for (i = 0; i < completed_frames; i++) {}' in
> i40e_clean_xdp_tx_irq()

	if (likely(!tx_ring->xdp_tx_active)) {
		xsk_frames = completed_frames;
		goto skip;
	}
> 
> >
> > point of such change is to limit the interrupts raised by HW once it is
> > done with sending the descriptor. you still walk the descs one-by-one in
> > ixgbe_clean_xdp_tx_irq().
> 
> Sorry, I must be missing something important. In my view only at the
> end of ixgbe_xmit_zc(), ixgbe always kicks the hardware through
> ixgbe_xdp_ring_update_tail() before/after this series.
> 
> As to 'one-by-one', I see i40e also handles like that in 'for (i = 0;
> i < completed_frames; i++)' in i40e_clean_xdp_tx_irq(). Ice does this
> in ice_clean_xdp_irq_zc()?

i40e does not look up DD bit from descriptor. plus this loop you refer to
is taken only when (see above) xdp_tx_active is not 0 (meaning that there
have been some XDP_TX action on queue and we have to clean the buffer in a
different way).

in general i would advise to look at ice as i40e writes back the tx ring
head which is used in cleaning logic. ice does not have this feature,
neither does ixgbe.

> 
> Could you shed some light on this? Thanks in advance!
> 
> Thanks,
> Jason
> 
> >
> > > +}
> > > +
> > > +static void ixgbe_xmit_pkt(struct ixgbe_ring *xdp_ring, struct xdp_desc *desc,
> > > +                        int i)
> > > +
> > >  {
> > >       struct xsk_buff_pool *pool = xdp_ring->xsk_pool;
> > >       union ixgbe_adv_tx_desc *tx_desc = NULL;
> > >       struct ixgbe_tx_buffer *tx_bi;
> > > -     struct xdp_desc desc;
> > >       dma_addr_t dma;
> > >       u32 cmd_type;
> > >
> > > -     if (!budget)
> > > -             return true;
> > > +     dma = xsk_buff_raw_get_dma(pool, desc[i].addr);
> > > +     xsk_buff_raw_dma_sync_for_device(pool, dma, desc[i].len);
> > >
> > > -     while (likely(budget)) {
> > > -             if (!netif_carrier_ok(xdp_ring->netdev))
> > > -                     break;
> > > +     tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use];
> > > +     tx_bi->bytecount = desc[i].len;
> > > +     tx_bi->xdpf = NULL;
> > > +     tx_bi->gso_segs = 1;
> > >
> > > -             if (!xsk_tx_peek_desc(pool, &desc))
> > > -                     break;
> > > +     tx_desc = IXGBE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> > > +     tx_desc->read.buffer_addr = cpu_to_le64(dma);
> > >
> > > -             dma = xsk_buff_raw_get_dma(pool, desc.addr);
> > > -             xsk_buff_raw_dma_sync_for_device(pool, dma, desc.len);
> > > +     cmd_type = IXGBE_ADVTXD_DTYP_DATA |
> > > +                IXGBE_ADVTXD_DCMD_DEXT |
> > > +                IXGBE_ADVTXD_DCMD_IFCS;
> > > +     cmd_type |= desc[i].len | IXGBE_TXD_CMD_EOP;
> > > +     tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
> > > +     tx_desc->read.olinfo_status =
> > > +             cpu_to_le32(desc[i].len << IXGBE_ADVTXD_PAYLEN_SHIFT);
> > >
> > > -             tx_bi = &xdp_ring->tx_buffer_info[xdp_ring->next_to_use];
> > > -             tx_bi->bytecount = desc.len;
> > > -             tx_bi->xdpf = NULL;
> > > -             tx_bi->gso_segs = 1;
> > > +     xdp_ring->next_to_use++;
> > > +}
> > >
> > > -             tx_desc = IXGBE_TX_DESC(xdp_ring, xdp_ring->next_to_use);
> > > -             tx_desc->read.buffer_addr = cpu_to_le64(dma);
> > > +static void ixgbe_xmit_pkt_batch(struct ixgbe_ring *xdp_ring, struct xdp_desc *desc)
> > > +{
> > > +     u32 i;
> > >
> > > -             /* put descriptor type bits */
> > > -             cmd_type = IXGBE_ADVTXD_DTYP_DATA |
> > > -                        IXGBE_ADVTXD_DCMD_DEXT |
> > > -                        IXGBE_ADVTXD_DCMD_IFCS;
> > > -             cmd_type |= desc.len | IXGBE_TXD_CMD;
> > > -             tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
> > > -             tx_desc->read.olinfo_status =
> > > -                     cpu_to_le32(desc.len << IXGBE_ADVTXD_PAYLEN_SHIFT);
> > > +     unrolled_count(PKTS_PER_BATCH)
> > > +     for (i = 0; i < PKTS_PER_BATCH; i++)
> > > +             ixgbe_xmit_pkt(xdp_ring, desc, i);
> > > +}
> > >
> > > -             xdp_ring->next_to_use++;
> > > -             if (xdp_ring->next_to_use == xdp_ring->count)
> > > -                     xdp_ring->next_to_use = 0;
> > > +static void ixgbe_fill_tx_hw_ring(struct ixgbe_ring *xdp_ring,
> > > +                               struct xdp_desc *descs, u32 nb_pkts)
> > > +{
> > > +     u32 batched, leftover, i;
> > > +
> > > +     batched = nb_pkts & ~(PKTS_PER_BATCH - 1);
> > > +     leftover = nb_pkts & (PKTS_PER_BATCH - 1);
> > > +     for (i = 0; i < batched; i += PKTS_PER_BATCH)
> > > +             ixgbe_xmit_pkt_batch(xdp_ring, &descs[i]);
> > > +     for (i = batched; i < batched + leftover; i++)
> > > +             ixgbe_xmit_pkt(xdp_ring, &descs[i], 0);
> > > +}
> > >
> > > -             budget--;
> > > -     }
> > > +static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
> > > +{
> > > +     struct xdp_desc *descs = xdp_ring->xsk_pool->tx_descs;
> > > +     u32 nb_pkts, nb_processed = 0;
> > >
> > > -     if (tx_desc) {
> > > -             ixgbe_xdp_ring_update_tail(xdp_ring);
> > > -             xsk_tx_release(pool);
> > > +     if (!netif_carrier_ok(xdp_ring->netdev))
> > > +             return true;
> > > +
> > > +     nb_pkts = xsk_tx_peek_release_desc_batch(xdp_ring->xsk_pool, budget);
> > > +     if (!nb_pkts)
> > > +             return true;
> > > +
> > > +     if (xdp_ring->next_to_use + nb_pkts >= xdp_ring->count) {
> > > +             nb_processed = xdp_ring->count - xdp_ring->next_to_use;
> > > +             ixgbe_fill_tx_hw_ring(xdp_ring, descs, nb_processed);
> > > +             xdp_ring->next_to_use = 0;
> > >       }
> > >
> > > -     return !!budget;
> > > +     ixgbe_fill_tx_hw_ring(xdp_ring, &descs[nb_processed], nb_pkts - nb_processed);
> > > +
> > > +     ixgbe_set_rs_bit(xdp_ring);
> > > +     ixgbe_xdp_ring_update_tail(xdp_ring);
> > > +
> > > +     return nb_pkts < budget;
> > >  }
> > >
> > >  static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
> > > --
> > > 2.41.3
> > >