netdev - Re: [RFC PATCH v5] vmxnet3: Add XDP support.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <935e24d6f6b51b5aaee4cf086ad08474e75410b8.camel@gmail.com>
Date:   Wed, 14 Dec 2022 08:14:58 -0800
From:   Alexander H Duyck <alexander.duyck@...il.com>
To:     William Tu <u9012063@...il.com>, netdev@...r.kernel.org
Cc:     tuc@...are.com, gyang@...are.com, doshir@...are.com
Subject: Re: [RFC PATCH v5] vmxnet3: Add XDP support.

On Tue, 2022-12-13 at 16:05 -0800, William Tu wrote:
> The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
> 
> Background:
> The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
> For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
> mapped to the ring's descriptor. If LRO is enabled and packet size larger
> than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
> the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
> allocated using alloc_page. So for LRO packets, the payload will be in one
> buffer from r0 and multiple from r1, for non-LRO packets, only one
> descriptor in r0 is used for packet size less than 3k.
> 
> When receiving a packet, the first descriptor will have the sop (start of
> packet) bit set, and the last descriptor will have the eop (end of packet)
> bit set. Non-LRO packets will have only one descriptor with both sop and
> eop set.
> 
> Other than r0 and r1, vmxnet3 dataring is specifically designed for
> handling packets with small size, usually 128 bytes, defined in
> VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
> driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
> order to avoid memory mapping/unmapping overhead. In summary, packet size:
>     A. < 128B: use dataring
>     B. 128B - 3K: use ring0
>     C. > 3K: use ring0 and ring1
> As a result, the patch adds XDP support for packets using dataring
> and r0 (case A and B), not the large packet size when LRO is enabled.
> 
> XDP Implementation:
> When user loads and XDP prog, vmxnet3 driver checks configurations, such
> as mtu, lro, and re-allocate the rx buffer size for reserving the extra
> headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
> associated with every rx queue of the device. Note that when using dataring
> for small packet size, vmxnet3 (front-end driver) doesn't control the
> buffer allocation, as a result the XDP frame's headroom is zero.
> 
> The receive side of XDP is implemented for case A and B, by invoking the
> bpf program at vmxnet3_rq_rx_complete and handle its returned action.
> The new vmxnet3_run_xdp function handles the difference of using dataring
> or ring0, and decides the next journey of the packet afterward.
> 
> For TX, vmxnet3 has split header design. Outgoing packets are parsed
> first and protocol headers (L2/L3/L4) are copied to the backend. The
> rest of the payload are dma mapped. Since XDP_TX does not parse the
> packet protocol, the entire XDP frame is using dma mapped for the
> transmission.
> 
> Performance:
> Tested using two VMs inside one ESXi machine, using single core on each
> vmxnet3 device, sender using DPDK testpmd tx-mode attached to vmxnet3
> device, sending 64B or 512B packet.
> 
> VM1 txgen:
> $ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
> --forward-mode=txonly --eth-peer=0,<mac addr of vm2>
> option: add "--txonly-multi-flow"
> option: use --txpkts=512 or 64 byte
> 
> VM2 running XDP:
> $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
> $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
> options: XDP_DROP, XDP_PASS, XDP_TX
> 
> To test REDIRECT to cpu 0, use
> $ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
> 
> Single core performance comparison with skb-mode.
> 64B:      skb-mode -> native-mode (with this patch)
> XDP_DROP: 960Kpps -> 2.4Mpps
> XDP_PASS: 240Kpps -> 499Kpps
> XDP_TX:   683Kpps -> 2.3Mpps
> REDIRECT: 389Kpps -> 449Kpps
> Same performance compared to v2.
> 
> 512B:      skb-mode -> native-mode (with this patch)
>                       v2      v3
> XDP_DROP: 640Kpps -> 914Kpps -> 1.3Mpps
> XDP_PASS: 220Kpps -> 240Kpps -> 280Kpps
> XDP_TX:   483Kpps -> 886Kpps -> 1.3Mpps
> REDIRECT: 365Kpps -> 1.2Mpps(?) -> 261Kpps
> 
> Good performance improvement over v2, due to skip
> skb allocation.
> 
> Limitations:
> a. LRO will be disabled when users load XDP program
> b. MTU will be checked and limit to
>    VMXNET3_MAX_SKB_BUF_SIZE(3K) - XDP_PACKET_HEADROOM(256) -
>    SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> 
> Signed-off-by: William Tu <u9012063@...il.com>
> ---
> v4 -> v5:
> - move XDP code to separate file: vmxnet3_xdp.{c, h},
>   suggested by Guolin
> - expose vmxnet3_rq_create_all and vmxnet3_adjust_rx_ring_size
> - more test using samples/bpf/{xdp1, xdp2, xdp_adjust_tail}
> - add debug print
> - rebase on commit 65e6af6cebe
> 
> v3 -> v4:
> - code refactoring and improved commit message
> - make dataring and non-dataring case clear
> - in XDP_PASS, handle xdp.data and xdp.data_end change after
>   bpf program executed
> - now still working on internal testing
> - v4 applied on net-next commit 65e6af6cebef
> 
> v2 -> v3:
> - code refactoring: move the XDP processing to the front
>   of the vmxnet3_rq_rx_complete, and minimize the places
>   of changes to existing code.
> - Performance improvement over BUF_SKB (512B) due to skipping
>   skb allocation when DROP and TX.
> 
> v1 -> v2:
> - Avoid skb allocation for small packet size (when dataring is used)
> - Use rcu_read_lock unlock instead of READ_ONCE
> - Peroformance improvement over v1
> - Merge xdp drop, tx, pass, and redirect into 1 patch
> 
> I tested the patch using below script:
> while [ true ]; do
> timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP --skb-mode
> timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP
> timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS --skb-mode
> timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS
> timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX --skb-mode
> timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX
> timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
> timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e pass
> done
> ---
>  drivers/net/vmxnet3/Makefile          |   2 +-
>  drivers/net/vmxnet3/vmxnet3_drv.c     |  48 ++-
>  drivers/net/vmxnet3/vmxnet3_ethtool.c |  14 +
>  drivers/net/vmxnet3/vmxnet3_int.h     |  20 ++
>  drivers/net/vmxnet3/vmxnet3_xdp.c     | 454 ++++++++++++++++++++++++++
>  drivers/net/vmxnet3/vmxnet3_xdp.h     |  39 +++
>  6 files changed, 569 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.c
>  create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.h
> 
> 

<...>

> +int
> +vmxnet3_process_xdp(struct vmxnet3_adapter *adapter,
> +		    struct vmxnet3_rx_queue *rq,
> +		    struct Vmxnet3_RxCompDesc *rcd,
> +		    struct vmxnet3_rx_buf_info *rbi,
> +		    struct Vmxnet3_RxDesc *rxd,
> +		    bool *need_flush)
> +{
> +	struct bpf_prog *xdp_prog;
> +	dma_addr_t new_dma_addr;
> +	struct sk_buff *new_skb;
> +	bool rxDataRingUsed;
> +	int ret, act;
> +
> +	ret = VMXNET3_XDP_CONTINUE;
> +	if (unlikely(rcd->len == 0))
> +		return VMXNET3_XDP_TAKEN;
> +
> +	rxDataRingUsed = VMXNET3_RX_DATA_RING(adapter, rcd->rqID);
> +	rcu_read_lock();
> +	xdp_prog = rcu_dereference(rq->xdp_bpf_prog);
> +	if (!xdp_prog) {
> +		rcu_read_unlock();
> +		return VMXNET3_XDP_CONTINUE;
> +	}
> +	act = vmxnet3_run_xdp(rq, rbi, rcd, need_flush, rxDataRingUsed);
> +	rcu_read_unlock();
> +
> +	switch (act) {
> +	case XDP_PASS:
> +		ret = VMXNET3_XDP_CONTINUE;
> +		break;
> +	case XDP_DROP:
> +	case XDP_TX:
> +	case XDP_REDIRECT:
> +	case XDP_ABORTED:
> +	default:
> +		/* Reuse and remap the existing buffer. */
> +		ret = VMXNET3_XDP_TAKEN;
> +		if (rxDataRingUsed)
> +			return ret;
> +
> +		new_skb = rbi->skb;
> +		new_dma_addr =
> +			dma_map_single(&adapter->pdev->dev,
> +				       new_skb->data, rbi->len,
> +				       DMA_FROM_DEVICE);
> +		if (dma_mapping_error(&adapter->pdev->dev,
> +				      new_dma_addr)) {
> +			dev_kfree_skb(new_skb);
> +			rq->stats.drop_total++;
> +			return ret;
> +		}
> +		rbi->dma_addr = new_dma_addr;
> +		rxd->addr = cpu_to_le64(rbi->dma_addr);
> +		rxd->len = rbi->len;
> +	}
> +	return ret;
> +}

FOr XDP_DROP and XDP_ABORTED this makes sense. You might want to add a
trace point in the case of aborted just so you can catch such cases for
debug.

However for XDP_TX and XDP_REDIRECT shouldn't both of those be calling
out to seperate functions to either place the frame on a Tx ring or to
hand it off to xdp_do_redirect so that the frame gets routed where it
needs to go? Also don't you run a risk with overwriting frames that
might be waiting on transmit?