[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALDO+SZ9q9kw=rydJos-6P9Ny1P6TeUVX=tWQGS0OXj7o+FNaw@mail.gmail.com>
Date: Sat, 5 Jan 2019 07:55:25 -0800
From: William Tu <u9012063@...il.com>
To: Toshiaki Makita <toshiaki.makita1@...il.com>
Cc: Björn Töpel <bjorn.topel@...il.com>,
Magnus Karlsson <magnus.karlsson@...il.com>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Linux Kernel Network Developers <netdev@...r.kernel.org>,
makita.toshiaki@....ntt.co.jp, Yi-Hung Wei <yihung.wei@...il.com>,
"Karlsson, Magnus" <magnus.karlsson@...el.com>
Subject: Re: [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode.
Hi Toshiaki,
Thanks a lot for the feedback.
On Tue, Jan 1, 2019 at 5:44 AM Toshiaki Makita
<toshiaki.makita1@...il.com> wrote:
>
> Hi, William. Nice work.
> I have some feedback and questions.
>
> > + while (peer_rq->xsk_umem && budget--) {
> > + unsigned int inner_xdp_xmit = 0;
> > + unsigned int metasize = 0;
> > + struct xdp_frame *xdpf;
> > + bool dropped = false;
> > + struct sk_buff *skb;
> > + struct page *page;
> > + void *vaddr;
> > + void *addr;
> > + u32 len;
> > +
> > + if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
> > + break;
>
> How do you prevent races around xsk_umem?
> It seems you are checking xsk_umem in above while() condition, but there
> is no guarantee that xsk_umem is not NULL here, since umem can be
> disabled under us?
you're right. Looking at similar code (i40e_xsk_umem_enable/disable).
When setting xsk_umem to NULL (at veth_xsk_umem_disable), I should first
disable the napi, then set to NULL.
>
> > +
> > + page = dev_alloc_page();
> > + if (!page) {
> > + xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> > + xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> > + return -ENOMEM;
> > + }
> > +
> > + addr = page_to_virt(page);
> > + xdpf = addr;
> > + memset(xdpf, 0, sizeof(*xdpf));
> > +
> > + addr += sizeof(*xdpf);
> > + memcpy(addr, vaddr, len);
> > +
> > + xdpf->data = addr + metasize;
> > + xdpf->len = len;
> > + xdpf->headroom = 0;
> > + xdpf->metasize = metasize;
> > + xdpf->mem.type = MEM_TYPE_PAGE_SHARED;
> > +
> > + /* put into rq */
> > + skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
> > + if (!skb) {
> > + /* Peer side has XDP program attached */
> > + if (inner_xdp_xmit & VETH_XDP_TX) {
> > + /* Not supported */
> > + pr_warn("veth: peer XDP_TX not supported\n");
>
> As this can be triggered by users we need ratelimit at least.
How to rate limit here? Can I slow down the napi poll or reduce the budget?
>
> But since this is envisioned to be used in OVS, XDP_TX would be a very
> important feature to me. I expect XDP programs in containers to process
> packets and send back to OVS.
It's a little tricky here since the receiving veth pulls the packet sent from
its peer side, but due to XDP_TX, it has to put the packet back to its peer
side to receive. But I can see the use case you mentioned. Let me think
about how to implement.
>
> > + xdp_return_frame(xdpf);
> > + dropped = true;
> > + goto skip_tx;
> > + } else if (inner_xdp_xmit & VETH_XDP_REDIR) {
> > + xdp_do_flush_map();
> > + } else {
> > + dropped = true;
> > + }
> > + } else {
> > + napi_gro_receive(&rq->xdp_napi, skb);
> > + }
> > +skip_tx:
> > + xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> > + xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> > +
> > + /* update rq stats */
> > + u64_stats_update_begin(&rq->stats.syncp);
> > + rq->stats.xdp_packets++;
> > + rq->stats.xdp_bytes += len;
> > + if (dropped)
> > + rq->stats.xdp_drops++;
> > + u64_stats_update_end(&rq->stats.syncp);
> > + done++;
> > + }
> > + return done;
> > +}
> > +
> > static int veth_poll(struct napi_struct *napi, int budget)
> > {
> > struct veth_rq *rq =
> > container_of(napi, struct veth_rq, xdp_napi);
> > unsigned int xdp_xmit = 0;
> > + int tx_done;
> > int done;
> >
> > xdp_set_return_frame_no_direct();
> > @@ -756,13 +846,17 @@ static int veth_poll(struct napi_struct *napi, int budget)
> > }
> > }
> >
> > + tx_done = veth_xsk_poll(napi, budget);
> > + if (tx_done > 0)
> > + done += tx_done;
> > +
>
> This receives packets more than budget.
>
> > if (xdp_xmit & VETH_XDP_TX)
> > veth_xdp_flush(rq->dev);
> > if (xdp_xmit & VETH_XDP_REDIR)
> > xdp_do_flush_map();
> > xdp_clear_return_frame_no_direct();
> >
> > - return done;
> > + return done > budget ? budget : done;
> > }
> >
> > static int veth_napi_add(struct net_device *dev)
> > @@ -776,6 +870,7 @@ static int veth_napi_add(struct net_device *dev)
> > err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
> > if (err)
> > goto err_xdp_ring;
> > + rq->qid = i;
> > }
> >
> > for (i = 0; i < dev->real_num_rx_queues; i++) {
> > @@ -812,6 +907,7 @@ static void veth_napi_del(struct net_device *dev)
> > netif_napi_del(&rq->xdp_napi);
> > rq->rx_notify_masked = false;
> > ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
> > + rq->qid = -1;
> > }
> > }
> >
> > @@ -836,6 +932,7 @@ static int veth_enable_xdp(struct net_device *dev)
> >
> > /* Save original mem info as it can be overwritten */
> > rq->xdp_mem = rq->xdp_rxq.mem;
> > + rq->qid = i;
> > }
> >
> > err = veth_napi_add(dev);
> > @@ -1115,6 +1212,84 @@ static u32 veth_xdp_query(struct net_device *dev)
> > return 0;
> > }
> >
> > +int veth_xsk_umem_query(struct net_device *dev, struct xdp_umem **umem,
> > + u16 qid)
> > +{
> > + struct xdp_umem *queried_umem;
> > +
> > + queried_umem = xdp_get_umem_from_qid(dev, qid);
> > +
> > + if (!queried_umem)
> > + return -EINVAL;
> > +
> > + *umem = queried_umem;
> > + return 0;
> > +}
> > +
> > +static int veth_xsk_umem_enable(struct net_device *dev,
> > + struct xdp_umem *umem,
> > + u16 qid)
> > +{
> > + struct veth_priv *priv = netdev_priv(dev);
> > + struct xdp_umem_fq_reuse *reuseq;
> > + int err = 0;
> > +
> > + if (qid >= dev->real_num_rx_queues)
> > + return -EINVAL;
> > +
> > + reuseq = xsk_reuseq_prepare(priv->rq[0].xdp_ring.size);
> > + if (!reuseq)
> > + return -ENOMEM;
> > +
> > + xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
> > +
> > + priv->rq[qid].xsk_umem = umem;
> > +
> > + return err;
> > +}
> > +
> > +static int veth_xsk_umem_disable(struct net_device *dev,
> > + u16 qid)
> > +{
> > + struct veth_priv *priv = netdev_priv(dev);
> > + struct xdp_umem *umem;
> > +
> > + umem = xdp_get_umem_from_qid(dev, qid);
> > + if (!umem)
> > + return -EINVAL;
> > +
> > + priv->rq[qid].xsk_umem = NULL;
> > + return 0;
> > +}
> > +
> > +int veth_xsk_umem_setup(struct net_device *dev, struct xdp_umem *umem,
> > + u16 qid)
> > +{
> > + return umem ? veth_xsk_umem_enable(dev, umem, qid) :
> > + veth_xsk_umem_disable(dev, qid);
> > +}
> > +
> > +int veth_xsk_async_xmit(struct net_device *dev, u32 qid)
> > +{
> > + struct veth_priv *priv, *peer_priv;
> > + struct net_device *peer_dev;
> > + struct veth_rq *peer_rq;
> > +
> > + priv = netdev_priv(dev);
> > + peer_dev = priv->peer;
> > + peer_priv = netdev_priv(peer_dev);
> > + peer_rq = &peer_priv->rq[qid];
> > +
> > + if (qid >= dev->real_num_rx_queues)
> > + return -ENXIO;
> > +
> > + /* Schedule the peer side NAPI to receive */
> > + if (!napi_if_scheduled_mark_missed(&peer_rq->xdp_napi))
> > + napi_schedule(&peer_rq->xdp_napi);
> > +
> > + return 0;
> > +}
> > +
> > static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> > {
> > switch (xdp->command) {
> > @@ -1123,6 +1298,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> > case XDP_QUERY_PROG:
> > xdp->prog_id = veth_xdp_query(dev);
> > return 0;
> > + case XDP_QUERY_XSK_UMEM:
> > + return veth_xsk_umem_query(dev, &xdp->xsk.umem,
> > + xdp->xsk.queue_id);
> > + case XDP_SETUP_XSK_UMEM: {
> > + struct veth_priv *priv;
> > + int err;
> > +
> > + /* Enable NAPI on both sides, by enabling
> > + * their XDP.
> > + */
> > + err = veth_enable_xdp(dev);
>
> Looks like there is no need to enable XDP on this side. You only use
> peer's NAPI, right?
Yes, thanks. The peer's NAPI is doing the work.
>
> > + if (err)
> > + return err;
> > +
> > + priv = netdev_priv(dev);
> > + err = veth_enable_xdp(priv->peer);
>
> Enabling NAPI here and never disable it?
Here XDP_SETUP_XSK_UMEM might be enable or disable,
so I should disable NAPI when receiving umem disable command.
Will fix it in next version.
> Also, what happens if peer disables XDP later by detaching an XDP program?
> Probably you need something like refcounting.
OK, so the refconting will keep the NAPI running when XDP program is detached.
>
>
> BTW I'm not 100% sure if current way of accessing peer rq in NAPI
> handler is safe although I did not find an obvious problem.
>
> Looking into physical NIC drivers, ndo_async_xmit() is expected to kick
> TX softirq? This should be something like below process in veth.
right, the physical NIC driver (i40e and ixgbe), the async_xmit simply
kick TX softirq. For veth, we're doing a little different. We kick the peer
side's RX softirq.
>
> Say you attach an xsk to veth A, and B is the peer of A.
>
> 1. async_xmit kicks A's NAPI
> 2. A's NAPI drains xsk tx_ring and pushes frames into B's xdp_ring, then
> kicks B's NAPI
> 3. B's NAPI drains xdp_ring and process xdp_frames in the same way as
> normal xdp_frames.
>
> What you are currently doing seems to be skipping 2 and making B's NAPI
> directly drain tx_ring of xsk bound to A.
Right.
>
> I don't have particular opinion about which is better. Probably your
> approach is more efficient performance-wise? If later you find some race
> with your approach, please consider more physical-NIC-like approach I
> described above.
Yes, I thought without pushing/draining frames into/outof xdp_ring
might have better performance.
I will work on the next version. Thanks!
William
>
> > + if (err)
> > + return err;
> > +
> > + return veth_xsk_umem_setup(dev, xdp->xsk.umem,
> > + xdp->xsk.queue_id);
> > + }
> > default:
> > return -EINVAL;
> > }
> > @@ -1145,6 +1342,7 @@ static const struct net_device_ops veth_netdev_ops = {
> > .ndo_set_rx_headroom = veth_set_rx_headroom,
> > .ndo_bpf = veth_xdp,
> > .ndo_xdp_xmit = veth_xdp_xmit,
> > + .ndo_xsk_async_xmit = veth_xsk_async_xmit,
> > };
> >
> > #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> >
Powered by blists - more mailing lists