[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALDO+Sa3uFqd6SCd5sKhpcDcoBaniC-Wm0wci+sk+e2eyRhP4A@mail.gmail.com>
Date: Fri, 16 Dec 2022 15:55:59 -0800
From: William Tu <u9012063@...il.com>
To: Alexander H Duyck <alexander.duyck@...il.com>
Cc: netdev@...r.kernel.org, tuc@...are.com, gyang@...are.com,
doshir@...are.com
Subject: Re: [PATCH v6] vmxnet3: Add XDP support.
On Fri, Dec 16, 2022 at 2:53 PM William Tu <u9012063@...il.com> wrote:
>
> Hi Alexander,
> Thanks for taking a look at this patch!
>
> On Fri, Dec 16, 2022 at 9:14 AM Alexander H Duyck
> <alexander.duyck@...il.com> wrote:
> >
> > On Fri, 2022-12-16 at 06:41 -0800, William Tu wrote:
> > > The patch adds native-mode XDP support: XDP DROP, PASS, TX, and REDIRECT.
> > >
> > > Background:
> > > The vmxnet3 rx consists of three rings: ring0, ring1, and dataring.
> > > For r0 and r1, buffers at r0 are allocated using alloc_skb APIs and dma
> > > mapped to the ring's descriptor. If LRO is enabled and packet size larger
> > > than 3K, VMXNET3_MAX_SKB_BUF_SIZE, then r1 is used to mapped the rest of
> > > the buffer larger than VMXNET3_MAX_SKB_BUF_SIZE. Each buffer in r1 is
> > > allocated using alloc_page. So for LRO packets, the payload will be in one
> > > buffer from r0 and multiple from r1, for non-LRO packets, only one
> > > descriptor in r0 is used for packet size less than 3k.
> > >
> > > When receiving a packet, the first descriptor will have the sop (start of
> > > packet) bit set, and the last descriptor will have the eop (end of packet)
> > > bit set. Non-LRO packets will have only one descriptor with both sop and
> > > eop set.
> > >
> > > Other than r0 and r1, vmxnet3 dataring is specifically designed for
> > > handling packets with small size, usually 128 bytes, defined in
> > > VMXNET3_DEF_RXDATA_DESC_SIZE, by simply copying the packet from the backend
> > > driver in ESXi to the ring's memory region at front-end vmxnet3 driver, in
> > > order to avoid memory mapping/unmapping overhead. In summary, packet size:
> > > A. < 128B: use dataring
> > > B. 128B - 3K: use ring0 (VMXNET3_RX_BUF_SKB)
> > > C. > 3K: use ring0 and ring1 (VMXNET3_RX_BUF_SKB + VMXNET3_RX_BUF_PAGE)
> > > As a result, the patch adds XDP support for packets using dataring
> > > and r0 (case A and B), not the large packet size when LRO is enabled.
> > >
> > > XDP Implementation:
> > > When user loads and XDP prog, vmxnet3 driver checks configurations, such
> > > as mtu, lro, and re-allocate the rx buffer size for reserving the extra
> > > headroom, XDP_PACKET_HEADROOM, for XDP frame. The XDP prog will then be
> > > associated with every rx queue of the device. Note that when using dataring
> > > for small packet size, vmxnet3 (front-end driver) doesn't control the
> > > buffer allocation, as a result the XDP frame's headroom is zero.
> > >
> > > The receive side of XDP is implemented for case A and B, by invoking the
> > > bpf program at vmxnet3_rq_rx_complete and handle its returned action.
> > > The new vmxnet3_run_xdp function handles the difference of using dataring
> > > or ring0, and decides the next journey of the packet afterward.
> > >
> > > For TX, vmxnet3 has split header design. Outgoing packets are parsed
> > > first and protocol headers (L2/L3/L4) are copied to the backend. The
> > > rest of the payload are dma mapped. Since XDP_TX does not parse the
> > > packet protocol, the entire XDP frame is using dma mapped for the
> > > transmission. Because the actual TX is deferred until txThreshold, it's
> > > possible that an rx buffer is being overwritten by incoming burst of rx
> > > packets, before tx buffer being transmitted. As a result, we allocate new
> > > skb and introduce a copy.
> > >
> > > Performance:
> > > Tested using two VMs inside one ESXi machine, using single core on each
> > > vmxnet3 device, sender using DPDK testpmd tx-mode attached to vmxnet3
> > > device, sending 64B or 512B packet.
> > >
> > > VM1 txgen:
> > > $ dpdk-testpmd -l 0-3 -n 1 -- -i --nb-cores=3 \
> > > --forward-mode=txonly --eth-peer=0,<mac addr of vm2>
> > > option: add "--txonly-multi-flow"
> > > option: use --txpkts=512 or 64 byte
> > >
> > > VM2 running XDP:
> > > $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options> --skb-mode
> > > $ ./samples/bpf/xdp_rxq_info -d ens160 -a <options>
> > > options: XDP_DROP, XDP_PASS, XDP_TX
> > >
> > > To test REDIRECT to cpu 0, use
> > > $ ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
> > >
> > > Single core performance comparison with skb-mode.
> > > 64B: skb-mode -> native-mode (with this patch)
> > > XDP_DROP: 932Kpps -> 2.0Mpps
> > > XDP_PASS: 284Kpps -> 314Kpps
> > > XDP_TX: 591Kpps -> 1.8Mpps
> > > REDIRECT: 453Kpps -> 501Kpps
> > >
> > > 512B: skb-mode -> native-mode (with this patch)
> > > XDP_DROP: 890Kpps -> 1.3Mpps
> > > XDP_PASS: 284Kpps -> 314Kpps
> > > XDP_TX: 555Kpps -> 1.2Mpps
> > > REDIRECT: 670Kpps -> 430Kpps
> > >
> >
> > I hadn't noticed it before. Based on this it looks like native mode is
> > performing worse then skb-mode for redirect w/ 512B packets? Have you
> > looked into why that might be?
>
> yes, I noticed it but don't know why, maybe it's due to extra copy and page
> allocation like you said below. I will dig deeper.
>
> >
> > My main concern would be that you are optimizing for recyling in the Tx
> > and Redirect paths, when you might be better off just releasing the
> > buffers and batch allocating new pages in your Rx path.
>
> right, are you talking about using the page pool allocator, ex: slide 8 below
> https://legacy.netdevconf.info/0x14/pub/slides/10/add-xdp-on-driver.pdf
> I tried it before but then I found I have to replace lots of existing vmxnet3
> code, basically replacing all the rx/tx buffer allocation code with new
> page pool api, even without XDP. I'd love to give it a try, do you think it's
> worth doing it?
>
> >
> > > Limitations:
> > > a. LRO will be disabled when users load XDP program
> > > b. MTU will be checked and limit to
> > > VMXNET3_MAX_SKB_BUF_SIZE(3K) - XDP_PACKET_HEADROOM(256) -
> > > SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> > >
> > > Signed-off-by: William Tu <u9012063@...il.com>
> > > ---
> > > v5 -> v6:
> > > work on feedbacks from Alexander Duyck
> > > - remove unused skb parameter at vmxnet3_xdp_xmit
> > > - add trace point for XDP_ABORTED
> > > - avoid TX packet buffer being overwritten by allocatin
> > > new skb and memcpy (TX performance drop from 2.3 to 1.8Mpps)
> > > - update some of the performance number and a demo video
> > > https://youtu.be/I3nx5wQDTXw
> > > - pass VMware internal regression test using non-ENS: vmxnet3v2,
> > > vmxnet3v3, vmxnet3v4, so remove RFC tag.
> > >
> > > v4 -> v5:
> > > - move XDP code to separate file: vmxnet3_xdp.{c, h},
> > > suggested by Guolin
> > > - expose vmxnet3_rq_create_all and vmxnet3_adjust_rx_ring_size
> > > - more test using samples/bpf/{xdp1, xdp2, xdp_adjust_tail}
> > > - add debug print
> > > - rebase on commit 65e6af6cebe
> > >
> > > v3 -> v4:
> > > - code refactoring and improved commit message
> > > - make dataring and non-dataring case clear
> > > - in XDP_PASS, handle xdp.data and xdp.data_end change after
> > > bpf program executed
> > > - now still working on internal testing
> > > - v4 applied on net-next commit 65e6af6cebef
> > >
> > > v2 -> v3:
> > > - code refactoring: move the XDP processing to the front
> > > of the vmxnet3_rq_rx_complete, and minimize the places
> > > of changes to existing code.
> > > - Performance improvement over BUF_SKB (512B) due to skipping
> > > skb allocation when DROP and TX.
> > >
> > > v1 -> v2:
> > > - Avoid skb allocation for small packet size (when dataring is used)
> > > - Use rcu_read_lock unlock instead of READ_ONCE
> > > - Peroformance improvement over v1
> > > - Merge xdp drop, tx, pass, and redirect into 1 patch
> > >
> > > I tested the patch using below script:
> > > while [ true ]; do
> > > timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP --skb-mode
> > > timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_DROP
> > > timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS --skb-mode
> > > timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_PASS
> > > timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX --skb-mode
> > > timeout 20 ./samples/bpf/xdp_rxq_info -d ens160 -a XDP_TX
> > > timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
> > > timeout 20 ./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e pass
> > > done
> > > ---
> > > drivers/net/vmxnet3/Makefile | 2 +-
> > > drivers/net/vmxnet3/vmxnet3_drv.c | 48 ++-
> > > drivers/net/vmxnet3/vmxnet3_ethtool.c | 14 +
> > > drivers/net/vmxnet3/vmxnet3_int.h | 20 ++
> > > drivers/net/vmxnet3/vmxnet3_xdp.c | 458 ++++++++++++++++++++++++++
> > > drivers/net/vmxnet3/vmxnet3_xdp.h | 39 +++
> > > 6 files changed, 573 insertions(+), 8 deletions(-)
> > > create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.c
> > > create mode 100644 drivers/net/vmxnet3/vmxnet3_xdp.h
> > >
> > > diff --git a/drivers/net/vmxnet3/Makefile b/drivers/net/vmxnet3/Makefile
> > > index a666a88ac1ff..f82870c10205 100644
> > > --- a/drivers/net/vmxnet3/Makefile
> > > +++ b/drivers/net/vmxnet3/Makefile
> > > @@ -32,4 +32,4 @@
> > >
> > > obj-$(CONFIG_VMXNET3) += vmxnet3.o
> > >
> > > -vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o
> > > +vmxnet3-objs := vmxnet3_drv.o vmxnet3_ethtool.o vmxnet3_xdp.o
> > > diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
> > > index d3e7b27eb933..b55fec2ac2bf 100644
> > > --- a/drivers/net/vmxnet3/vmxnet3_drv.c
> > > +++ b/drivers/net/vmxnet3/vmxnet3_drv.c
> > > @@ -28,6 +28,7 @@
> > > #include <net/ip6_checksum.h>
> > >
> > > #include "vmxnet3_int.h"
> > > +#include "vmxnet3_xdp.h"
> > >
> > > char vmxnet3_driver_name[] = "vmxnet3";
> > > #define VMXNET3_DRIVER_DESC "VMware vmxnet3 virtual NIC driver"
> > > @@ -351,7 +352,6 @@ vmxnet3_unmap_pkt(u32 eop_idx, struct vmxnet3_tx_queue *tq,
> > > BUG_ON(VMXNET3_TXDESC_GET_EOP(&(tq->tx_ring.base[eop_idx].txd)) != 1);
> > >
> > > skb = tq->buf_info[eop_idx].skb;
> > > - BUG_ON(skb == NULL);
> > > tq->buf_info[eop_idx].skb = NULL;
> > >
> > > VMXNET3_INC_RING_IDX_ONLY(eop_idx, tq->tx_ring.size);
> > > @@ -592,6 +592,9 @@ vmxnet3_rq_alloc_rx_buf(struct vmxnet3_rx_queue *rq, u32 ring_idx,
> > > rbi->skb = __netdev_alloc_skb_ip_align(adapter->netdev,
> > > rbi->len,
> > > GFP_KERNEL);
> > > + if (adapter->xdp_enabled)
> > > + skb_reserve(rbi->skb, XDP_PACKET_HEADROOM);
> > > +
> > > if (unlikely(rbi->skb == NULL)) {
> > > rq->stats.rx_buf_alloc_failure++;
> > > break;
> > > @@ -1404,6 +1407,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> > > struct Vmxnet3_RxDesc rxCmdDesc;
> > > struct Vmxnet3_RxCompDesc rxComp;
> > > #endif
> > > + bool need_flush = 0;
> > > +
> > > vmxnet3_getRxComp(rcd, &rq->comp_ring.base[rq->comp_ring.next2proc].rcd,
> > > &rxComp);
> > > while (rcd->gen == rq->comp_ring.gen) {
> > > @@ -1444,6 +1449,19 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> > > goto rcd_done;
> > > }
> > >
> > > + if (unlikely(rcd->sop && rcd->eop && adapter->xdp_enabled)) {
> > > + int act = vmxnet3_process_xdp(adapter, rq, rcd, rbi,
> > > + rxd, &need_flush);
> > > + ctx->skb = NULL;
> > > + switch (act) {
> > > + case VMXNET3_XDP_TAKEN:
> > > + goto rcd_done;
> > > + case VMXNET3_XDP_CONTINUE:
> > > + default:
> > > + break;
> > > + }
> > > + }
> > > +
> > > if (rcd->sop) { /* first buf of the pkt */
> > > bool rxDataRingUsed;
> > > u16 len;
> > > @@ -1483,6 +1501,10 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> > > goto rcd_done;
> > > }
> > >
> > > + if (adapter->xdp_enabled && !rxDataRingUsed)
> > > + skb_reserve(new_skb,
> > > + XDP_PACKET_HEADROOM);
> > > +
> > > if (rxDataRingUsed) {
> > > size_t sz;
> > >
> > > @@ -1730,6 +1752,8 @@ vmxnet3_rq_rx_complete(struct vmxnet3_rx_queue *rq,
> > > vmxnet3_getRxComp(rcd,
> > > &rq->comp_ring.base[rq->comp_ring.next2proc].rcd, &rxComp);
> > > }
> > > + if (need_flush)
> > > + xdp_do_flush_map();
> > >
> > > return num_pkts;
> > > }
> > > @@ -1776,6 +1800,7 @@ vmxnet3_rq_cleanup(struct vmxnet3_rx_queue *rq,
> > >
> > > rq->comp_ring.gen = VMXNET3_INIT_GEN;
> > > rq->comp_ring.next2proc = 0;
> > > + rq->xdp_bpf_prog = NULL;
> > > }
> > >
> > >
> > > @@ -1788,7 +1813,6 @@ vmxnet3_rq_cleanup_all(struct vmxnet3_adapter *adapter)
> > > vmxnet3_rq_cleanup(&adapter->rx_queue[i], adapter);
> > > }
> > >
> > > -
> > > static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> > > struct vmxnet3_adapter *adapter)
> > > {
> > > @@ -1832,6 +1856,8 @@ static void vmxnet3_rq_destroy(struct vmxnet3_rx_queue *rq,
> > > kfree(rq->buf_info[0]);
> > > rq->buf_info[0] = NULL;
> > > rq->buf_info[1] = NULL;
> > > +
> > > + vmxnet3_unregister_xdp_rxq(rq);
> > > }
> > >
> > > static void
> > > @@ -1893,6 +1919,10 @@ vmxnet3_rq_init(struct vmxnet3_rx_queue *rq,
> > > }
> > > vmxnet3_rq_alloc_rx_buf(rq, 1, rq->rx_ring[1].size - 1, adapter);
> > >
> > > + /* always register, even if no XDP prog used */
> > > + if (vmxnet3_register_xdp_rxq(rq, adapter))
> > > + return -EINVAL;
> > > +
> > > /* reset the comp ring */
> > > rq->comp_ring.next2proc = 0;
> > > memset(rq->comp_ring.base, 0, rq->comp_ring.size *
> > > @@ -1989,7 +2019,7 @@ vmxnet3_rq_create(struct vmxnet3_rx_queue *rq, struct vmxnet3_adapter *adapter)
> > > }
> > >
> > >
> > > -static int
> > > +int
> > > vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter)
> > > {
> > > int i, err = 0;
> > > @@ -2585,7 +2615,8 @@ vmxnet3_setup_driver_shared(struct vmxnet3_adapter *adapter)
> > > if (adapter->netdev->features & NETIF_F_RXCSUM)
> > > devRead->misc.uptFeatures |= UPT1_F_RXCSUM;
> > >
> > > - if (adapter->netdev->features & NETIF_F_LRO) {
> > > + if (adapter->netdev->features & NETIF_F_LRO &&
> > > + !adapter->xdp_enabled) {
> > > devRead->misc.uptFeatures |= UPT1_F_LRO;
> > > devRead->misc.maxNumRxSG = cpu_to_le16(1 + MAX_SKB_FRAGS);
> > > }
> > > @@ -3026,7 +3057,7 @@ vmxnet3_free_pci_resources(struct vmxnet3_adapter *adapter)
> > > }
> > >
> > >
> > > -static void
> > > +void
> > > vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
> > > {
> > > size_t sz, i, ring0_size, ring1_size, comp_size;
> > > @@ -3035,7 +3066,8 @@ vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter)
> > > if (adapter->netdev->mtu <= VMXNET3_MAX_SKB_BUF_SIZE -
> > > VMXNET3_MAX_ETH_HDR_SIZE) {
> > > adapter->skb_buf_size = adapter->netdev->mtu +
> > > - VMXNET3_MAX_ETH_HDR_SIZE;
> > > + VMXNET3_MAX_ETH_HDR_SIZE +
> > > + vmxnet3_xdp_headroom(adapter);
> > > if (adapter->skb_buf_size < VMXNET3_MIN_T0_BUF_SIZE)
> > > adapter->skb_buf_size = VMXNET3_MIN_T0_BUF_SIZE;
> > >
> > > @@ -3563,7 +3595,6 @@ vmxnet3_reset_work(struct work_struct *data)
> > > clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
> > > }
> > >
> > > -
> > > static int
> > > vmxnet3_probe_device(struct pci_dev *pdev,
> > > const struct pci_device_id *id)
> > > @@ -3585,6 +3616,8 @@ vmxnet3_probe_device(struct pci_dev *pdev,
> > > #ifdef CONFIG_NET_POLL_CONTROLLER
> > > .ndo_poll_controller = vmxnet3_netpoll,
> > > #endif
> > > + .ndo_bpf = vmxnet3_xdp,
> > > + .ndo_xdp_xmit = vmxnet3_xdp_xmit,
> > > };
> > > int err;
> > > u32 ver;
> > > @@ -3900,6 +3933,7 @@ vmxnet3_probe_device(struct pci_dev *pdev,
> > > goto err_register;
> > > }
> > >
> > > + adapter->xdp_enabled = false;
> > > vmxnet3_check_link(adapter, false);
> > > return 0;
> > >
> > > diff --git a/drivers/net/vmxnet3/vmxnet3_ethtool.c b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> > > index 18cf7c723201..6f542236b26e 100644
> > > --- a/drivers/net/vmxnet3/vmxnet3_ethtool.c
> > > +++ b/drivers/net/vmxnet3/vmxnet3_ethtool.c
> > > @@ -76,6 +76,10 @@ vmxnet3_tq_driver_stats[] = {
> > > copy_skb_header) },
> > > { " giant hdr", offsetof(struct vmxnet3_tq_driver_stats,
> > > oversized_hdr) },
> > > + { " xdp xmit", offsetof(struct vmxnet3_tq_driver_stats,
> > > + xdp_xmit) },
> > > + { " xdp xmit err", offsetof(struct vmxnet3_tq_driver_stats,
> > > + xdp_xmit_err) },
> > > };
> > >
> > > /* per rq stats maintained by the device */
> > > @@ -106,6 +110,16 @@ vmxnet3_rq_driver_stats[] = {
> > > drop_fcs) },
> > > { " rx buf alloc fail", offsetof(struct vmxnet3_rq_driver_stats,
> > > rx_buf_alloc_failure) },
> > > + { " xdp packets", offsetof(struct vmxnet3_rq_driver_stats,
> > > + xdp_packets) },
> > > + { " xdp tx", offsetof(struct vmxnet3_rq_driver_stats,
> > > + xdp_tx) },
> > > + { " xdp redirects", offsetof(struct vmxnet3_rq_driver_stats,
> > > + xdp_redirects) },
> > > + { " xdp drops", offsetof(struct vmxnet3_rq_driver_stats,
> > > + xdp_drops) },
> > > + { " xdp aborted", offsetof(struct vmxnet3_rq_driver_stats,
> > > + xdp_aborted) },
> > > };
> > >
> > > /* global stats maintained by the driver */
> > > diff --git a/drivers/net/vmxnet3/vmxnet3_int.h b/drivers/net/vmxnet3/vmxnet3_int.h
> > > index 3367db23aa13..5cf4033930d8 100644
> > > --- a/drivers/net/vmxnet3/vmxnet3_int.h
> > > +++ b/drivers/net/vmxnet3/vmxnet3_int.h
> > > @@ -56,6 +56,8 @@
> > > #include <linux/if_arp.h>
> > > #include <linux/inetdevice.h>
> > > #include <linux/log2.h>
> > > +#include <linux/bpf.h>
> > > +#include <linux/skbuff.h>
> > >
> > > #include "vmxnet3_defs.h"
> > >
> > > @@ -217,6 +219,9 @@ struct vmxnet3_tq_driver_stats {
> > > u64 linearized; /* # of pkts linearized */
> > > u64 copy_skb_header; /* # of times we have to copy skb header */
> > > u64 oversized_hdr;
> > > +
> > > + u64 xdp_xmit;
> > > + u64 xdp_xmit_err;
> > > };
> > >
> > > struct vmxnet3_tx_ctx {
> > > @@ -285,6 +290,12 @@ struct vmxnet3_rq_driver_stats {
> > > u64 drop_err;
> > > u64 drop_fcs;
> > > u64 rx_buf_alloc_failure;
> > > +
> > > + u64 xdp_packets; /* Total packets processed by XDP. */
> > > + u64 xdp_tx;
> > > + u64 xdp_redirects;
> > > + u64 xdp_drops;
> > > + u64 xdp_aborted;
> > > };
> > >
> > > struct vmxnet3_rx_data_ring {
> > > @@ -307,6 +318,8 @@ struct vmxnet3_rx_queue {
> > > struct vmxnet3_rx_buf_info *buf_info[2];
> > > struct Vmxnet3_RxQueueCtrl *shared;
> > > struct vmxnet3_rq_driver_stats stats;
> > > + struct bpf_prog __rcu *xdp_bpf_prog;
> > > + struct xdp_rxq_info xdp_rxq;
> > > } __attribute__((__aligned__(SMP_CACHE_BYTES)));
> > >
> > > #define VMXNET3_DEVICE_MAX_TX_QUEUES 32
> > > @@ -415,6 +428,7 @@ struct vmxnet3_adapter {
> > > u16 tx_prod_offset;
> > > u16 rx_prod_offset;
> > > u16 rx_prod2_offset;
> > > + bool xdp_enabled;
> > > };
> > >
> > > #define VMXNET3_WRITE_BAR0_REG(adapter, reg, val) \
> > > @@ -490,6 +504,12 @@ vmxnet3_tq_destroy_all(struct vmxnet3_adapter *adapter);
> > > void
> > > vmxnet3_rq_destroy_all(struct vmxnet3_adapter *adapter);
> > >
> > > +int
> > > +vmxnet3_rq_create_all(struct vmxnet3_adapter *adapter);
> > > +
> > > +void
> > > +vmxnet3_adjust_rx_ring_size(struct vmxnet3_adapter *adapter);
> > > +
> > > netdev_features_t
> > > vmxnet3_fix_features(struct net_device *netdev, netdev_features_t features);
> > >
> > > diff --git a/drivers/net/vmxnet3/vmxnet3_xdp.c b/drivers/net/vmxnet3/vmxnet3_xdp.c
> > > new file mode 100644
> > > index 000000000000..afb2d43b5464
> > > --- /dev/null
> > > +++ b/drivers/net/vmxnet3/vmxnet3_xdp.c
> > > @@ -0,0 +1,458 @@
> > > +// SPDX-License-Identifier: GPL-2.0-or-later
> > > +/*
> > > + * Linux driver for VMware's vmxnet3 ethernet NIC.
> > > + * Copyright (C) 2008-2023, VMware, Inc. All Rights Reserved.
> > > + * Maintained by: pv-drivers@...are.com
> > > + *
> > > + */
> > > +
> > > +#include "vmxnet3_int.h"
> > > +#include "vmxnet3_xdp.h"
> > > +
> > > +static void
> > > +vmxnet3_xdp_exchange_program(struct vmxnet3_adapter *adapter,
> > > + struct bpf_prog *prog)
> > > +{
> > > + struct vmxnet3_rx_queue *rq;
> > > + int i;
> > > +
> > > + for (i = 0; i < adapter->num_rx_queues; i++) {
> > > + rq = &adapter->rx_queue[i];
> > > + rcu_assign_pointer(rq->xdp_bpf_prog, prog);
> > > + }
> > > + if (prog)
> > > + adapter->xdp_enabled = true;
> > > + else
> > > + adapter->xdp_enabled = false;
> > > +}
> > > +
> > > +static int
> > > +vmxnet3_xdp_set(struct net_device *netdev, struct netdev_bpf *bpf,
> > > + struct netlink_ext_ack *extack)
> > > +{
> > > + struct vmxnet3_adapter *adapter = netdev_priv(netdev);
> > > + struct bpf_prog *new_bpf_prog = bpf->prog;
> > > + struct bpf_prog *old_bpf_prog;
> > > + bool need_update;
> > > + bool running;
> > > + int err = 0;
> > > +
> > > + if (new_bpf_prog && netdev->mtu > VMXNET3_XDP_MAX_MTU) {
> > > + NL_SET_ERR_MSG_MOD(extack, "MTU too large for XDP");
> > > + return -EOPNOTSUPP;
> > > + }
> > > +
> > > + old_bpf_prog = READ_ONCE(adapter->rx_queue[0].xdp_bpf_prog);
> > > + if (!new_bpf_prog && !old_bpf_prog) {
> > > + adapter->xdp_enabled = false;
> > > + return 0;
> > > + }
> > > + running = netif_running(netdev);
> > > + need_update = !!old_bpf_prog != !!new_bpf_prog;
> > > +
> > > + if (running && need_update)
> > > + vmxnet3_quiesce_dev(adapter);
> > > +
> > > + vmxnet3_xdp_exchange_program(adapter, new_bpf_prog);
> > > + if (old_bpf_prog)
> > > + bpf_prog_put(old_bpf_prog);
> > > +
> > > + if (running && need_update) {
> > > + vmxnet3_reset_dev(adapter);
> > > + vmxnet3_rq_destroy_all(adapter);
> > > + vmxnet3_adjust_rx_ring_size(adapter);
> > > + err = vmxnet3_rq_create_all(adapter);
> > > + if (err) {
> > > + NL_SET_ERR_MSG_MOD(extack,
> > > + "failed to re-create rx queues for XDP.");
> > > + err = -EOPNOTSUPP;
> > > + goto out;
> > > + }
> > > + err = vmxnet3_activate_dev(adapter);
> > > + if (err) {
> > > + NL_SET_ERR_MSG_MOD(extack,
> > > + "failed to activate device for XDP.");
> > > + err = -EOPNOTSUPP;
> > > + goto out;
> > > + }
> > > + clear_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state);
> > > + }
> > > +out:
> > > + return err;
> > > +}
> > > +
> > > +/* This is the main xdp call used by kernel to set/unset eBPF program. */
> > > +int
> > > +vmxnet3_xdp(struct net_device *netdev, struct netdev_bpf *bpf)
> > > +{
> > > + switch (bpf->command) {
> > > + case XDP_SETUP_PROG:
> > > + netdev_dbg(netdev, "XDP: set program to ");
> > > + return vmxnet3_xdp_set(netdev, bpf, bpf->extack);
> > > + default:
> > > + return -EINVAL;
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +int
> > > +vmxnet3_xdp_headroom(struct vmxnet3_adapter *adapter)
> > > +{
> > > + if (adapter->xdp_enabled)
> > > + return VMXNET3_XDP_ROOM;
> > > + else
> > > + return 0;
> > > +}
> > > +
> > > +void
> > > +vmxnet3_unregister_xdp_rxq(struct vmxnet3_rx_queue *rq)
> > > +{
> > > + xdp_rxq_info_unreg_mem_model(&rq->xdp_rxq);
> > > + xdp_rxq_info_unreg(&rq->xdp_rxq);
> > > +}
> > > +
> > > +int
> > > +vmxnet3_register_xdp_rxq(struct vmxnet3_rx_queue *rq,
> > > + struct vmxnet3_adapter *adapter)
> > > +{
> > > + int err;
> > > +
> > > + err = xdp_rxq_info_reg(&rq->xdp_rxq, adapter->netdev, rq->qid, 0);
> > > + if (err < 0)
> > > + return err;
> > > +
> > > + err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq, MEM_TYPE_PAGE_SHARED,
> > > + NULL);
> > > + if (err < 0) {
> > > + xdp_rxq_info_unreg(&rq->xdp_rxq);
> > > + return err;
> > > + }
> > > + return 0;
> > > +}
> > > +
> > > +static int
> > > +vmxnet3_xdp_xmit_frame(struct vmxnet3_adapter *adapter,
> > > + struct xdp_frame *xdpf,
> > > + struct vmxnet3_tx_queue *tq)
> > > +{
> > > + struct vmxnet3_tx_buf_info *tbi = NULL;
> > > + union Vmxnet3_GenericDesc *gdesc;
> > > + struct vmxnet3_tx_ctx ctx;
> > > + int tx_num_deferred;
> > > + struct sk_buff *skb;
> > > + u32 buf_size;
> > > + int ret = 0;
> > > + u32 dw2;
> > > +
> > > + if (vmxnet3_cmd_ring_desc_avail(&tq->tx_ring) == 0) {
> > > + tq->stats.tx_ring_full++;
> > > + ret = -ENOMEM;
> > > + goto exit;
> > > + }
> > > +
> > > + skb = __netdev_alloc_skb(adapter->netdev, xdpf->len, GFP_KERNEL);
> > > + if (unlikely(!skb))
> > > + goto exit;
> > > +
> >
> > Any reason for this not to return an error. Seems like this might be
> > better off being the ENOMEM w/ the no descriptor case being something
> > like ENOSPC.
>
> my mistake, will fix it, thanks!
>
> >
> > Also at some point you may want to look at supporting handling page
> > fragments or XDP frames as allocating an skb for an XDP transmit can be
> > expensive.
>
> OK! I haven't thought about that yet, will keep it in mind.
>
> >
> > > + memcpy(skb->data, xdpf->data, xdpf->len);
> > > + dw2 = (tq->tx_ring.gen ^ 0x1) << VMXNET3_TXD_GEN_SHIFT;
> > > + dw2 |= xdpf->len;
> > > + ctx.sop_txd = tq->tx_ring.base + tq->tx_ring.next2fill;
> > > + gdesc = ctx.sop_txd;
> > > +
> > > + buf_size = xdpf->len;
> > > + tbi = tq->buf_info + tq->tx_ring.next2fill;
> > > + tbi->map_type = VMXNET3_MAP_SINGLE;
> > > + tbi->dma_addr = dma_map_single(&adapter->pdev->dev,
> > > + skb->data, buf_size,
> > > + DMA_TO_DEVICE);
> > > + if (dma_mapping_error(&adapter->pdev->dev, tbi->dma_addr)) {
> > > + ret = -EFAULT;
> > > + goto exit;
> > > + }
> >
> > Don't you need to free the skb here?
>
> Yes, will fix it.
> >
> > > + tbi->len = buf_size;
> > > +
> > > + gdesc = tq->tx_ring.base + tq->tx_ring.next2fill;
> > > + WARN_ON_ONCE(gdesc->txd.gen == tq->tx_ring.gen);
> > > +
> > > + gdesc->txd.addr = cpu_to_le64(tbi->dma_addr);
> > > + gdesc->dword[2] = cpu_to_le32(dw2);
> > > +
> > > + /* Setup the EOP desc */
> > > + gdesc->dword[3] = cpu_to_le32(VMXNET3_TXD_CQ | VMXNET3_TXD_EOP);
> > > +
> > > + gdesc->txd.om = 0;
> > > + gdesc->txd.msscof = 0;
> > > + gdesc->txd.hlen = 0;
> > > + gdesc->txd.ti = 0;
> > > +
> > > + tx_num_deferred = le32_to_cpu(tq->shared->txNumDeferred);
> > > + tq->shared->txNumDeferred += 1;
> > > + tx_num_deferred++;
> > > +
> > > + vmxnet3_cmd_ring_adv_next2fill(&tq->tx_ring);
> > > +
> > > + /* set the last buf_info for the pkt */
> > > + tbi->skb = skb;
> > > + tbi->sop_idx = ctx.sop_txd - tq->tx_ring.base;
> > > +
> > > + dma_wmb();
> > > + gdesc->dword[2] = cpu_to_le32(le32_to_cpu(gdesc->dword[2]) ^
> > > + VMXNET3_TXD_GEN);
> > > + if (tx_num_deferred >= le32_to_cpu(tq->shared->txThreshold)) {
> > > + tq->shared->txNumDeferred = 0;
> > > + VMXNET3_WRITE_BAR0_REG(adapter,
> > > + VMXNET3_REG_TXPROD + tq->qid * 8,
> > > + tq->tx_ring.next2fill);
> > > + }
> > > +exit:
> > > + return ret;
> > > +}
> > > +
> > > +static int
> > > +vmxnet3_xdp_xmit_back(struct vmxnet3_adapter *adapter,
> > > + struct xdp_frame *xdpf)
> > > +{
> > > + struct vmxnet3_tx_queue *tq;
> > > + struct netdev_queue *nq;
> > > + int err = 0, cpu;
> > > + int tq_number;
> > > +
> > > + tq_number = adapter->num_tx_queues;
> > > + cpu = smp_processor_id();
> > > + tq = &adapter->tx_queue[cpu % tq_number];
> > > + if (tq->stopped)
> > > + return -ENETDOWN;
> > > +
> > > + nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
> > > +
> > > + __netif_tx_lock(nq, cpu);
> > > + err = vmxnet3_xdp_xmit_frame(adapter, xdpf, tq);
> > > + if (err)
> > > + goto exit;
> > > +
> > > +exit:
> >
> > What is the point of the label/goto? I don't see another spot that
> > references it.
> Thanks, will fix it!
>
> >
> > > + __netif_tx_unlock(nq);
> > > + return err;
> > > +}
> > > +
> > > +int
> > > +vmxnet3_xdp_xmit(struct net_device *dev,
> > > + int n, struct xdp_frame **frames, u32 flags)
> > > +{
> > > + struct vmxnet3_adapter *adapter;
> > > + struct vmxnet3_tx_queue *tq;
> > > + struct netdev_queue *nq;
> > > + int i, err, cpu;
> > > + int nxmit = 0;
> > > + int tq_number;
> > > +
> > > + adapter = netdev_priv(dev);
> > > +
> > > + if (unlikely(test_bit(VMXNET3_STATE_BIT_QUIESCED, &adapter->state)))
> > > + return -ENETDOWN;
> > > + if (unlikely(test_bit(VMXNET3_STATE_BIT_RESETTING, &adapter->state)))
> > > + return -EINVAL;
> > > +
> > > + tq_number = adapter->num_tx_queues;
> > > + cpu = smp_processor_id();
> > > + tq = &adapter->tx_queue[cpu % tq_number];
> > > + if (tq->stopped)
> > > + return -ENETDOWN;
> > > +
> > > + nq = netdev_get_tx_queue(adapter->netdev, tq->qid);
> > > +
> > > + __netif_tx_lock(nq, cpu);
> > > + for (i = 0; i < n; i++) {
> > > + err = vmxnet3_xdp_xmit_frame(adapter, frames[i], tq);
> > > + if (err) {
> > > + tq->stats.xdp_xmit_err++;
> > > + break;
> > > + }
> > > + nxmit++;
> > > + }
> > > +
> > > + tq->stats.xdp_xmit += nxmit;
> > > + __netif_tx_unlock(nq);
> > > +
> >
> > Are you doing anything to free the frames after you transmit them? If I
> > am not mistaken you are just copying them over into skbs aren't you, so
> > what is freeing the frames after that?
>
> The frames will be free at vmxnet3_tq_cleanup() at dev_kfree_skb_any(tbi->skb);
> Because at the vmxnet3_xdp_xmit_frame the allocated skb is saved at tbi->skb,
> so it can be freed at tq cleanup.
>
> >
> > > + return nxmit;
> > > +}
> > > +
> > > +static int
> > > +__vmxnet3_run_xdp(struct vmxnet3_rx_queue *rq, void *data, int data_len,
> > > + int headroom, int frame_sz, bool *need_xdp_flush,
> > > + struct sk_buff *skb)
> > > +{
> > > + struct xdp_frame *xdpf;
> > > + void *buf_hard_start;
> > > + struct xdp_buff xdp;
> > > + struct page *page;
> > > + void *orig_data;
> > > + int err, delta;
> > > + int delta_len;
> > > + u32 act;
> > > +
> > > + buf_hard_start = data;
> > > + xdp_init_buff(&xdp, frame_sz, &rq->xdp_rxq);
> > > + xdp_prepare_buff(&xdp, buf_hard_start, headroom, data_len, true);
> > > + orig_data = xdp.data;
> > > +
> > > + act = bpf_prog_run_xdp(rq->xdp_bpf_prog, &xdp);
> > > + rq->stats.xdp_packets++;
> > > +
> > > + switch (act) {
> > > + case XDP_DROP:
> > > + rq->stats.xdp_drops++;
> > > + break;
> > > + case XDP_PASS:
> > > + /* bpf prog might change len and data position.
> > > + * dataring does not use skb so not support this.
> > > + */
> > > + delta = xdp.data - orig_data;
> > > + delta_len = (xdp.data_end - xdp.data) - data_len;
> > > + if (skb) {
> > > + skb_reserve(skb, delta);
> > > + skb_put(skb, delta_len);
> > > + }
> > > + break;
> > > + case XDP_TX:
> > > + xdpf = xdp_convert_buff_to_frame(&xdp);
> > > + if (!xdpf ||
> > > + vmxnet3_xdp_xmit_back(rq->adapter, xdpf)) {
> > > + rq->stats.xdp_drops++;
> > > + } else {
> > > + rq->stats.xdp_tx++;
> > > + }
> > > + break;
> > > + case XDP_ABORTED:
> > > + trace_xdp_exception(rq->adapter->netdev, rq->xdp_bpf_prog,
> > > + act);
> > > + rq->stats.xdp_aborted++;
> > > + break;
> > > + case XDP_REDIRECT:
> > > + page = alloc_page(GFP_ATOMIC);
> > > + if (!page) {
> > > + rq->stats.rx_buf_alloc_failure++;
> > > + return XDP_DROP;
> > > + }
> >
> > So I think I see the problem I had questions about here. If I am not
> > mistaken you are copying the buffer to this page, and then copying this
> > page to an skb right? I think you might be better off just writing off
> > the Tx/Redirect pages and letting them go through their respective
> > paths and just allocating new pages instead assuming these pages were
> > consumed.
>
> I'm not sure I understand, can you elaborate?
>
> For XDP_TX, I'm doing 1 extra copy, copying to the newly allocated skb
> in vmxnet3_xdp_xmit_back.
> For XDP_REDIREC, I allocate a page and copy to the page and call
> xdp_do_redirect, there is no copying to skb again. If I don't allocate a
> new page, it always crashes at
> [62020.425932] BUG: Bad page state in process cpumap/0/map:29 pfn:107548
> [62020.440905] kernel BUG at include/linux/mm.h:757!
> VM_BUG_ON_PAGE(page_ref_count(page) == 0, page);
just to provide more detail about the crash.
It happens when I run
./samples/bpf/xdp_redirect_cpu -d ens160 -c 0 -e drop
with the following stack
[62020.473606] RIP: 0010:page_frag_free+0xb6/0x140
[62020.572992] __xdp_return+0xad/0x260
[62020.577546] xdp_return_frame+0x2c/0xa0
[62020.582190] cpu_map_bpf_prog_run_xdp.isra.0+0xa3/0x320
Thanks
William
Powered by blists - more mailing lists