lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <YiJdem4OCBKwLpYv@bismarck.dyn.berto.se>
Date:   Fri, 4 Mar 2022 19:42:02 +0100
From:   Niklas Söderlund 
        <niklas.soderlund@...igine.com>
To:     Maciej Fijalkowski <maciej.fijalkowski@...el.com>
Cc:     Simon Horman <simon.horman@...igine.com>,
        David Miller <davem@...emloft.net>,
        Jakub Kicinski <kuba@...nel.org>, netdev@...r.kernel.org,
        oss-drivers@...igine.com
Subject: Re: [PATCH net-next 5/5] nfp: xsk: add AF_XDP zero-copy Rx and Tx
 support

Hi Maciej,

Thanks for your feedback.

On 2022-03-04 14:53:30 +0100, Maciej Fijalkowski wrote:
> On Fri, Mar 04, 2022 at 11:22:14AM +0100, Simon Horman wrote:
> > From: Niklas Söderlund <niklas.soderlund@...igine.com>
> > 
> > This patch adds zero-copy Rx and Tx support for AF_XDP sockets. It do so
> > by adding a separate NAPI poll function that is attached to a each
> > channel when the XSK socket is attached with XDP_SETUP_XSK_POOL, and
> > restored when the XSK socket is terminated, this is done per channel.
> > 
> > Support for XDP_TX is implemented and the XDP buffer can safely be moved
> > from the Rx to the Tx queue and correctly freed and returned to the XSK
> > pool once it's transmitted.
> > 
> > Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
> > will allocate a new buffer and copy the zero-copy frame prior
> > passing it to the kernel stack.
> > 
> > This patch is based on previous work by Jakub Kicinski.
> > 
> > Signed-off-by: Niklas Söderlund <niklas.soderlund@...igine.com>
> > Signed-off-by: Simon Horman <simon.horman@...igine.com>
> > ---
> >  drivers/net/ethernet/netronome/nfp/Makefile   |   1 +
> >  drivers/net/ethernet/netronome/nfp/nfp_net.h  |  31 +-
> >  .../ethernet/netronome/nfp/nfp_net_common.c   |  98 ++-
> >  .../ethernet/netronome/nfp/nfp_net_debugfs.c  |  33 +-
> >  .../net/ethernet/netronome/nfp/nfp_net_xsk.c  | 592 ++++++++++++++++++
> >  .../net/ethernet/netronome/nfp/nfp_net_xsk.h  |  29 +
> >  6 files changed, 756 insertions(+), 28 deletions(-)
> >  create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_xsk.c
> >  create mode 100644 drivers/net/ethernet/netronome/nfp/nfp_net_xsk.h
> 
> (...)
> 
> > +
> > +static void nfp_net_xsk_tx(struct nfp_net_tx_ring *tx_ring)
> > +{
> > +	struct nfp_net_r_vector *r_vec = tx_ring->r_vec;
> > +	struct xdp_desc desc[NFP_NET_XSK_TX_BATCH];
> > +	struct xsk_buff_pool *xsk_pool;
> > +	struct nfp_net_tx_desc *txd;
> > +	u32 pkts = 0, wr_idx;
> > +	u32 i, got;
> > +
> > +	xsk_pool = r_vec->xsk_pool;
> > +
> > +	while (nfp_net_tx_space(tx_ring) >= NFP_NET_XSK_TX_BATCH) {
> > +		for (i = 0; i < NFP_NET_XSK_TX_BATCH; i++)
> > +			if (!xsk_tx_peek_desc(xsk_pool, &desc[i]))
> > +				break;
> 
> Please use xsk_tx_peek_release_desc_batch() and avoid your own internal
> batching mechanisms. We introduced the array of xdp_descs on xsk_buff_pool
> side just for that purpose, so there's no need for having this here on
> stack.
> 
> I suppose this will also simplify this ZC support.

This is a great suggestion. I will create a patch to use 
xsk_tx_peek_release_desc_batch() here.

> 
> Dave, could you give us some time for review? It was on list only for few
> hours or so and I see it's already applied :<
> 
> > +		got = i;
> > +		if (!got)
> > +			break;
> > +
> > +		wr_idx = D_IDX(tx_ring, tx_ring->wr_p + i);
> > +		prefetchw(&tx_ring->txds[wr_idx]);
> > +
> > +		for (i = 0; i < got; i++)
> > +			xsk_buff_raw_dma_sync_for_device(xsk_pool, desc[i].addr,
> > +							 desc[i].len);
> > +
> > +		for (i = 0; i < got; i++) {
> > +			wr_idx = D_IDX(tx_ring, tx_ring->wr_p + i);
> > +
> > +			tx_ring->txbufs[wr_idx].real_len = desc[i].len;
> > +			tx_ring->txbufs[wr_idx].is_xsk_tx = false;
> > +
> > +			/* Build TX descriptor. */
> > +			txd = &tx_ring->txds[wr_idx];
> > +			nfp_desc_set_dma_addr(txd,
> > +					      xsk_buff_raw_get_dma(xsk_pool,
> > +								   desc[i].addr
> > +								   ));
> > +			txd->offset_eop = PCIE_DESC_TX_EOP;
> > +			txd->dma_len = cpu_to_le16(desc[i].len);
> > +			txd->data_len = cpu_to_le16(desc[i].len);
> > +		}
> > +
> > +		tx_ring->wr_p += got;
> > +		pkts += got;
> > +	}
> > +
> > +	if (!pkts)
> > +		return;
> > +
> > +	xsk_tx_release(xsk_pool);
> > +	/* Ensure all records are visible before incrementing write counter. */
> > +	wmb();
> > +	nfp_qcp_wr_ptr_add(tx_ring->qcp_q, pkts);
> > +}
> > +
> > +static bool
> > +nfp_net_xsk_tx_xdp(const struct nfp_net_dp *dp, struct nfp_net_r_vector *r_vec,
> > +		   struct nfp_net_rx_ring *rx_ring,
> > +		   struct nfp_net_tx_ring *tx_ring,
> > +		   struct nfp_net_xsk_rx_buf *xrxbuf, unsigned int pkt_len,
> > +		   int pkt_off)
> > +{
> > +	struct xsk_buff_pool *pool = r_vec->xsk_pool;
> > +	struct nfp_net_tx_buf *txbuf;
> > +	struct nfp_net_tx_desc *txd;
> > +	unsigned int wr_idx;
> > +
> > +	if (nfp_net_tx_space(tx_ring) < 1)
> > +		return false;
> > +
> > +	xsk_buff_raw_dma_sync_for_device(pool, xrxbuf->dma_addr + pkt_off, pkt_len);
> > +
> > +	wr_idx = D_IDX(tx_ring, tx_ring->wr_p);
> > +
> > +	txbuf = &tx_ring->txbufs[wr_idx];
> > +	txbuf->xdp = xrxbuf->xdp;
> > +	txbuf->real_len = pkt_len;
> > +	txbuf->is_xsk_tx = true;
> > +
> > +	/* Build TX descriptor */
> > +	txd = &tx_ring->txds[wr_idx];
> > +	txd->offset_eop = PCIE_DESC_TX_EOP;
> > +	txd->dma_len = cpu_to_le16(pkt_len);
> > +	nfp_desc_set_dma_addr(txd, xrxbuf->dma_addr + pkt_off);
> > +	txd->data_len = cpu_to_le16(pkt_len);
> > +
> > +	txd->flags = 0;
> > +	txd->mss = 0;
> > +	txd->lso_hdrlen = 0;
> > +
> > +	tx_ring->wr_ptr_add++;
> > +	tx_ring->wr_p++;
> > +
> > +	return true;
> > +}
> > +
> > +static int nfp_net_rx_space(struct nfp_net_rx_ring *rx_ring)
> > +{
> > +	return rx_ring->cnt - rx_ring->wr_p + rx_ring->rd_p - 1;
> > +}
> > +
> > +static void
> > +nfp_net_xsk_rx_bufs_stash(struct nfp_net_rx_ring *rx_ring, unsigned int idx,
> > +			  struct xdp_buff *xdp)
> > +{
> > +	unsigned int headroom;
> > +
> > +	headroom = xsk_pool_get_headroom(rx_ring->r_vec->xsk_pool);
> > +
> > +	rx_ring->rxds[idx].fld.reserved = 0;
> > +	rx_ring->rxds[idx].fld.meta_len_dd = 0;
> > +
> > +	rx_ring->xsk_rxbufs[idx].xdp = xdp;
> > +	rx_ring->xsk_rxbufs[idx].dma_addr =
> > +		xsk_buff_xdp_get_frame_dma(xdp) + headroom;
> > +}
> > +
> > +static void nfp_net_xsk_rx_unstash(struct nfp_net_xsk_rx_buf *rxbuf)
> > +{
> > +	rxbuf->dma_addr = 0;
> > +	rxbuf->xdp = NULL;
> > +}
> > +
> > +static void nfp_net_xsk_rx_free(struct nfp_net_xsk_rx_buf *rxbuf)
> > +{
> > +	if (rxbuf->xdp)
> > +		xsk_buff_free(rxbuf->xdp);
> > +
> > +	nfp_net_xsk_rx_unstash(rxbuf);
> > +}
> > +
> > +void nfp_net_xsk_rx_bufs_free(struct nfp_net_rx_ring *rx_ring)
> > +{
> > +	unsigned int i;
> > +
> > +	if (!rx_ring->cnt)
> > +		return;
> > +
> > +	for (i = 0; i < rx_ring->cnt - 1; i++)
> > +		nfp_net_xsk_rx_free(&rx_ring->xsk_rxbufs[i]);
> > +}
> > +
> > +void nfp_net_xsk_rx_ring_fill_freelist(struct nfp_net_rx_ring *rx_ring)
> > +{
> > +	struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
> > +	struct xsk_buff_pool *pool = r_vec->xsk_pool;
> > +	unsigned int wr_idx, wr_ptr_add = 0;
> > +	struct xdp_buff *xdp;
> > +
> > +	while (nfp_net_rx_space(rx_ring)) {
> > +		wr_idx = D_IDX(rx_ring, rx_ring->wr_p);
> > +
> > +		xdp = xsk_buff_alloc(pool);
> 
> Could you use xsk_buff_alloc_batch() with nfp_net_rx_space(rx_ring) passed
> as requested count of xdp_bufs instead calling xsk_buff_alloc() in a loop?

I think I can, I will give it a try and post a patch. Thanks for the 
pointer.

> 
> > +		if (!xdp)
> > +			break;
> > +
> > +		nfp_net_xsk_rx_bufs_stash(rx_ring, wr_idx, xdp);
> > +
> > +		nfp_desc_set_dma_addr(&rx_ring->rxds[wr_idx].fld,
> > +				      rx_ring->xsk_rxbufs[wr_idx].dma_addr);
> > +
> > +		rx_ring->wr_p++;
> > +		wr_ptr_add++;
> > +	}
> > +
> > +	/* Ensure all records are visible before incrementing write counter. */
> > +	wmb();
> > +	nfp_qcp_wr_ptr_add(rx_ring->qcp_fl, wr_ptr_add);
> > +}
> > +
> 
> (...)

-- 
Regards,
Niklas Söderlund

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ