netdev - Re: [PATCH net-next 4/5] tsnep: Add XDP socket zero-copy RX support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZDAddB+bF7W2OhY1@boxer>
Date:   Fri, 7 Apr 2023 15:41:08 +0200
From:   Maciej Fijalkowski <maciej.fijalkowski@...el.com>
To:     Gerhard Engleder <gerhard@...leder-embedded.com>
CC:     <netdev@...r.kernel.org>, <bpf@...r.kernel.org>,
        <davem@...emloft.net>, <kuba@...nel.org>, <edumazet@...gle.com>,
        <pabeni@...hat.com>, <bjorn@...nel.org>,
        <magnus.karlsson@...el.com>, <jonathan.lemon@...il.com>
Subject: Re: [PATCH net-next 4/5] tsnep: Add XDP socket zero-copy RX support

On Wed, Apr 05, 2023 at 09:13:58PM +0200, Gerhard Engleder wrote:
> On 03.04.23 19:26, Maciej Fijalkowski wrote:
> > On Mon, Apr 03, 2023 at 07:19:15PM +0200, Maciej Fijalkowski wrote:
> > > On Sun, Apr 02, 2023 at 09:38:37PM +0200, Gerhard Engleder wrote:
> > > 
> > > Hey Gerhard,
> > > 
> > > > Add support for XSK zero-copy to RX path. The setup of the XSK pool can
> > > > be done at runtime. If the netdev is running, then the queue must be
> > > > disabled and enabled during reconfiguration. This can be done easily
> > > > with functions introduced in previous commits.
> > > > 
> > > > A more important property is that, if the netdev is running, then the
> > > > setup of the XSK pool shall not stop the netdev in case of errors. A
> > > > broken netdev after a failed XSK pool setup is bad behavior. Therefore,
> > > > the allocation and setup of resources during XSK pool setup is done only
> > > > before any queue is disabled. Additionally, freeing and later allocation
> > > > of resources is eliminated in some cases. Page pool entries are kept for
> > > > later use. Two memory models are registered in parallel. As a result,
> > > > the XSK pool setup cannot fail during queue reconfiguration.
> > > > 
> > > > In contrast to other drivers, XSK pool setup and XDP BPF program setup
> > > > are separate actions. XSK pool setup can be done without any XDP BPF
> > > > program. The XDP BPF program can be added, removed or changed without
> > > > any reconfiguration of the XSK pool.
> > > 
> > > I won't argue about your design, but I'd be glad if you would present any
> > > perf numbers (ZC vs copy mode) just to give us some overview how your
> > > implementation works out. Also, please consider using batching APIs and
> > > see if this gives you any boost (my assumption is that it would).
> > > 
> > > > 
> > > > Signed-off-by: Gerhard Engleder <gerhard@...leder-embedded.com>
> > > > ---
> > > >   drivers/net/ethernet/engleder/tsnep.h      |   7 +
> > > >   drivers/net/ethernet/engleder/tsnep_main.c | 432 ++++++++++++++++++++-
> > > >   drivers/net/ethernet/engleder/tsnep_xdp.c  |  67 ++++
> > > >   3 files changed, 488 insertions(+), 18 deletions(-)
> > 
> > (...)
> > 
> > > > +
> > > >   static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
> > > >   			       struct xdp_buff *xdp, int *status,
> > > > -			       struct netdev_queue *tx_nq, struct tsnep_tx *tx)
> > > > +			       struct netdev_queue *tx_nq, struct tsnep_tx *tx,
> > > > +			       bool zc)
> > > >   {
> > > >   	unsigned int length;
> > > > -	unsigned int sync;
> > > >   	u32 act;
> > > >   	length = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
> > > >   	act = bpf_prog_run_xdp(prog, xdp);
> > > > -
> > > > -	/* Due xdp_adjust_tail: DMA sync for_device cover max len CPU touch */
> > > > -	sync = xdp->data_end - xdp->data_hard_start - XDP_PACKET_HEADROOM;
> > > > -	sync = max(sync, length);
> > > > -
> > > >   	switch (act) {
> > > >   	case XDP_PASS:
> > > >   		return false;
> > > > @@ -1027,8 +1149,21 @@ static bool tsnep_xdp_run_prog(struct tsnep_rx *rx, struct bpf_prog *prog,
> > > >   		trace_xdp_exception(rx->adapter->netdev, prog, act);
> > > >   		fallthrough;
> > > >   	case XDP_DROP:
> > > > -		page_pool_put_page(rx->page_pool, virt_to_head_page(xdp->data),
> > > > -				   sync, true);
> > > > +		if (zc) {
> > > > +			xsk_buff_free(xdp);
> > > > +		} else {
> > > > +			unsigned int sync;
> > > > +
> > > > +			/* Due xdp_adjust_tail: DMA sync for_device cover max
> > > > +			 * len CPU touch
> > > > +			 */
> > > > +			sync = xdp->data_end - xdp->data_hard_start -
> > > > +			       XDP_PACKET_HEADROOM;
> > > > +			sync = max(sync, length);
> > > > +			page_pool_put_page(rx->page_pool,
> > > > +					   virt_to_head_page(xdp->data), sync,
> > > > +					   true);
> > > > +		}
> > > >   		return true;
> > > >   	}
> > > >   }
> > > > @@ -1181,7 +1316,8 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
> > > >   					 length, false);
> > > >   			consume = tsnep_xdp_run_prog(rx, prog, &xdp,
> > > > -						     &xdp_status, tx_nq, tx);
> > > > +						     &xdp_status, tx_nq, tx,
> > > > +						     false);
> > > >   			if (consume) {
> > > >   				rx->packets++;
> > > >   				rx->bytes += length;
> > > > @@ -1205,6 +1341,125 @@ static int tsnep_rx_poll(struct tsnep_rx *rx, struct napi_struct *napi,
> > > >   	return done;
> > > >   }
> > > > +static int tsnep_rx_poll_zc(struct tsnep_rx *rx, struct napi_struct *napi,
> > > > +			    int budget)
> > > > +{
> > > > +	struct tsnep_rx_entry *entry;
> > > > +	struct netdev_queue *tx_nq;
> > > > +	struct bpf_prog *prog;
> > > > +	struct tsnep_tx *tx;
> > > > +	int desc_available;
> > > > +	int xdp_status = 0;
> > > > +	struct page *page;
> > > > +	int done = 0;
> > > > +	int length;
> > > > +
> > > > +	desc_available = tsnep_rx_desc_available(rx);
> > > > +	prog = READ_ONCE(rx->adapter->xdp_prog);
> > > > +	if (prog) {
> > > > +		tx_nq = netdev_get_tx_queue(rx->adapter->netdev,
> > > > +					    rx->tx_queue_index);
> > > > +		tx = &rx->adapter->tx[rx->tx_queue_index];
> > > > +	}
> > > > +
> > > > +	while (likely(done < budget) && (rx->read != rx->write)) {
> > > > +		entry = &rx->entry[rx->read];
> > > > +		if ((__le32_to_cpu(entry->desc_wb->properties) &
> > > > +		     TSNEP_DESC_OWNER_COUNTER_MASK) !=
> > > > +		    (entry->properties & TSNEP_DESC_OWNER_COUNTER_MASK))
> > > > +			break;
> > > > +		done++;
> > > > +
> > > > +		if (desc_available >= TSNEP_RING_RX_REFILL) {
> > > > +			bool reuse = desc_available >= TSNEP_RING_RX_REUSE;
> > > > +
> > > > +			desc_available -= tsnep_rx_refill_zc(rx, desc_available,
> > > > +							     reuse);
> > > > +			if (!entry->xdp) {
> > > > +				/* buffer has been reused for refill to prevent
> > > > +				 * empty RX ring, thus buffer cannot be used for
> > > > +				 * RX processing
> > > > +				 */
> > > > +				rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
> > > > +				desc_available++;
> > > > +
> > > > +				rx->dropped++;
> > > > +
> > > > +				continue;
> > > > +			}
> > > > +		}
> > > > +
> > > > +		/* descriptor properties shall be read first, because valid data
> > > > +		 * is signaled there
> > > > +		 */
> > > > +		dma_rmb();
> > > > +
> > > > +		prefetch(entry->xdp->data);
> > > > +		length = __le32_to_cpu(entry->desc_wb->properties) &
> > > > +			 TSNEP_DESC_LENGTH_MASK;
> > > > +		entry->xdp->data_end = entry->xdp->data + length;
> > > > +		xsk_buff_dma_sync_for_cpu(entry->xdp, rx->xsk_pool);
> > > > +
> > > > +		/* RX metadata with timestamps is in front of actual data,
> > > > +		 * subtract metadata size to get length of actual data and
> > > > +		 * consider metadata size as offset of actual data during RX
> > > > +		 * processing
> > > > +		 */
> > > > +		length -= TSNEP_RX_INLINE_METADATA_SIZE;
> > > > +
> > > > +		rx->read = (rx->read + 1) % TSNEP_RING_SIZE;
> > > > +		desc_available++;
> > > > +
> > > > +		if (prog) {
> > > > +			bool consume;
> > > > +
> > > > +			entry->xdp->data += TSNEP_RX_INLINE_METADATA_SIZE;
> > > > +			entry->xdp->data_meta += TSNEP_RX_INLINE_METADATA_SIZE;
> > > > +
> > > > +			consume = tsnep_xdp_run_prog(rx, prog, entry->xdp,
> > > > +						     &xdp_status, tx_nq, tx,
> > > > +						     true);
> > > 
> > > reason for separate xdp run prog routine for ZC was usually "likely-fying"
> > > XDP_REDIRECT action as this is the main action for AF_XDP which was giving
> > > us perf improvement. Please try this out on your side to see if this
> > > yields any positive value.
> > 
> > One more thing - you have to handle XDP_TX action in a ZC specific way.
> > Your current code will break if you enable xsk_pool and return XDP_TX from
> > XDP prog.
> 
> I took again a look to igc, but I didn't found any specifics for XDP_TX
> ZC. Only some buffer flipping, which I assume is needed for shared
> pages.
> For ice I see a call to xdp_convert_buff_to_frame() in ZC path, which
> has some XSK logic within. Is this the ZC specific way? igc calls
> xdp_convert_buff_to_frame() in both cases, so I'm not sure. But I will
> try the XDP_TX action. I did test only with xdpsock.

I think I will back off a bit with a statement that your XDP_TX is clearly
broken, here's why.

igc when converting xdp_buff to xdp_frame and xdp_buff's memory being
backed by xsk_buff_pool will grab the new page from kernel, copy the
contents of xdp_buff to it, recycle xdp_buff back to xsk_buff_pool and
return new page back to driver (i have just described what
xdp_convert_zc_to_xdp_frame() is doing). Thing is that it is expensive and
hurts perf and we stepped away from this on ice, this is a matter of
storing the xdp_buff onto adequate Tx buffer struct that you can access
while cleaning Tx descriptors so that you'll be able to xsk_buff_free()
it.

So saying 'your current code will break' might have been too much from my
side. Just make sure that your XDP_TX action on ZC works. In order to do
that, i was loading xdpsock with XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD flag
on a queue receiving frames and then running xdp_rxq_info with XDP_TX
action.