netdev - Re: [PATCH bpf-next RFCv3 2/6] veth: support AF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CALDO+SZ9q9kw=rydJos-6P9Ny1P6TeUVX=tWQGS0OXj7o+FNaw@mail.gmail.com>
Date:   Sat, 5 Jan 2019 07:55:25 -0800
From:   William Tu <u9012063@...il.com>
To:     Toshiaki Makita <toshiaki.makita1@...il.com>
Cc:     Björn Töpel <bjorn.topel@...il.com>,
        Magnus Karlsson <magnus.karlsson@...il.com>,
        Alexei Starovoitov <ast@...nel.org>,
        Daniel Borkmann <daniel@...earbox.net>,
        Linux Kernel Network Developers <netdev@...r.kernel.org>,
        makita.toshiaki@....ntt.co.jp, Yi-Hung Wei <yihung.wei@...il.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>
Subject: Re: [PATCH bpf-next RFCv3 2/6] veth: support AF_XDP TX copy-mode.

Hi Toshiaki,

Thanks a lot for the feedback.

On Tue, Jan 1, 2019 at 5:44 AM Toshiaki Makita
<toshiaki.makita1@...il.com> wrote:
>
> Hi, William. Nice work.
> I have some feedback and questions.
>
> > +     while (peer_rq->xsk_umem && budget--) {
> > +             unsigned int inner_xdp_xmit = 0;
> > +             unsigned int metasize = 0;
> > +             struct xdp_frame *xdpf;
> > +             bool dropped = false;
> > +             struct sk_buff *skb;
> > +             struct page *page;
> > +             void *vaddr;
> > +             void *addr;
> > +             u32 len;
> > +
> > +             if (!xsk_umem_consume_tx_virtual(peer_rq->xsk_umem, &vaddr, &len))
> > +                     break;
>
> How do you prevent races around xsk_umem?
> It seems you are checking xsk_umem in above while() condition, but there
> is no guarantee that xsk_umem is not NULL here, since umem can be
> disabled under us?

you're right. Looking at similar code (i40e_xsk_umem_enable/disable).
When setting xsk_umem to NULL (at veth_xsk_umem_disable), I should first
disable the napi, then set to NULL.

>
> > +
> > +             page = dev_alloc_page();
> > +             if (!page) {
> > +                     xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> > +                     xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> > +                     return -ENOMEM;
> > +             }
> > +
> > +             addr = page_to_virt(page);
> > +             xdpf = addr;
> > +             memset(xdpf, 0, sizeof(*xdpf));
> > +
> > +             addr += sizeof(*xdpf);
> > +             memcpy(addr, vaddr, len);
> > +
> > +             xdpf->data = addr + metasize;
> > +             xdpf->len = len;
> > +             xdpf->headroom = 0;
> > +             xdpf->metasize = metasize;
> > +             xdpf->mem.type = MEM_TYPE_PAGE_SHARED;
> > +
> > +             /* put into rq */
> > +             skb = veth_xdp_rcv_one(rq, xdpf, &inner_xdp_xmit);
> > +             if (!skb) {
> > +                     /* Peer side has XDP program attached */
> > +                     if (inner_xdp_xmit & VETH_XDP_TX) {
> > +                             /* Not supported */
> > +                             pr_warn("veth: peer XDP_TX not supported\n");
>
> As this can be triggered by users we need ratelimit at least.
How to rate limit here? Can I slow down the napi poll or reduce the budget?

>
> But since this is envisioned to be used in OVS, XDP_TX would be a very
> important feature to me. I expect XDP programs in containers to process
> packets and send back to OVS.

It's a little tricky here since the receiving veth pulls the packet sent from
its peer side, but due to XDP_TX, it has to put the packet back to its peer
side to receive. But I can see the use case you mentioned. Let me think
about how to implement.

>
> > +                             xdp_return_frame(xdpf);
> > +                             dropped = true;
> > +                             goto skip_tx;
> > +                     } else if (inner_xdp_xmit & VETH_XDP_REDIR) {
> > +                             xdp_do_flush_map();
> > +                     } else {
> > +                             dropped = true;
> > +                     }
> > +             } else {
> > +                     napi_gro_receive(&rq->xdp_napi, skb);
> > +             }
> > +skip_tx:
> > +             xsk_umem_complete_tx(peer_rq->xsk_umem, 1);
> > +             xsk_umem_consume_tx_done(peer_rq->xsk_umem);
> > +
> > +             /* update rq stats */
> > +             u64_stats_update_begin(&rq->stats.syncp);
> > +             rq->stats.xdp_packets++;
> > +             rq->stats.xdp_bytes += len;
> > +             if (dropped)
> > +                     rq->stats.xdp_drops++;
> > +             u64_stats_update_end(&rq->stats.syncp);
> > +             done++;
> > +     }
> > +     return done;
> > +}
> > +
> >   static int veth_poll(struct napi_struct *napi, int budget)
> >   {
> >       struct veth_rq *rq =
> >               container_of(napi, struct veth_rq, xdp_napi);
> >       unsigned int xdp_xmit = 0;
> > +     int tx_done;
> >       int done;
> >
> >       xdp_set_return_frame_no_direct();
> > @@ -756,13 +846,17 @@ static int veth_poll(struct napi_struct *napi, int budget)
> >               }
> >       }
> >
> > +     tx_done = veth_xsk_poll(napi, budget);
> > +     if (tx_done > 0)
> > +             done += tx_done;
> > +
>
> This receives packets more than budget.
>
> >       if (xdp_xmit & VETH_XDP_TX)
> >               veth_xdp_flush(rq->dev);
> >       if (xdp_xmit & VETH_XDP_REDIR)
> >               xdp_do_flush_map();
> >       xdp_clear_return_frame_no_direct();
> >
> > -     return done;
> > +     return done > budget ? budget : done;
> >   }
> >
> >   static int veth_napi_add(struct net_device *dev)
> > @@ -776,6 +870,7 @@ static int veth_napi_add(struct net_device *dev)
> >               err = ptr_ring_init(&rq->xdp_ring, VETH_RING_SIZE, GFP_KERNEL);
> >               if (err)
> >                       goto err_xdp_ring;
> > +             rq->qid = i;
> >       }
> >
> >       for (i = 0; i < dev->real_num_rx_queues; i++) {
> > @@ -812,6 +907,7 @@ static void veth_napi_del(struct net_device *dev)
> >               netif_napi_del(&rq->xdp_napi);
> >               rq->rx_notify_masked = false;
> >               ptr_ring_cleanup(&rq->xdp_ring, veth_ptr_free);
> > +             rq->qid = -1;
> >       }
> >   }
> >
> > @@ -836,6 +932,7 @@ static int veth_enable_xdp(struct net_device *dev)
> >
> >                       /* Save original mem info as it can be overwritten */
> >                       rq->xdp_mem = rq->xdp_rxq.mem;
> > +                     rq->qid = i;
> >               }
> >
> >               err = veth_napi_add(dev);
> > @@ -1115,6 +1212,84 @@ static u32 veth_xdp_query(struct net_device *dev)
> >       return 0;
> >   }
> >
> > +int veth_xsk_umem_query(struct net_device *dev, struct xdp_umem **umem,
> > +                     u16 qid)
> > +{
> > +     struct xdp_umem *queried_umem;
> > +
> > +     queried_umem = xdp_get_umem_from_qid(dev, qid);
> > +
> > +     if (!queried_umem)
> > +             return -EINVAL;
> > +
> > +     *umem = queried_umem;
> > +     return 0;
> > +}
> > +
> > +static int veth_xsk_umem_enable(struct net_device *dev,
> > +                             struct xdp_umem *umem,
> > +                             u16 qid)
> > +{
> > +     struct veth_priv *priv = netdev_priv(dev);
> > +     struct xdp_umem_fq_reuse *reuseq;
> > +     int err = 0;
> > +
> > +     if (qid >= dev->real_num_rx_queues)
> > +             return -EINVAL;
> > +
> > +     reuseq = xsk_reuseq_prepare(priv->rq[0].xdp_ring.size);
> > +     if (!reuseq)
> > +             return -ENOMEM;
> > +
> > +     xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
> > +
> > +     priv->rq[qid].xsk_umem = umem;
> > +
> > +     return err;
> > +}
> > +
> > +static int veth_xsk_umem_disable(struct net_device *dev,
> > +                              u16 qid)
> > +{
> > +     struct veth_priv *priv = netdev_priv(dev);
> > +     struct xdp_umem *umem;
> > +
> > +     umem = xdp_get_umem_from_qid(dev, qid);
> > +     if (!umem)
> > +             return -EINVAL;
> > +
> > +     priv->rq[qid].xsk_umem = NULL;
> > +     return 0;
> > +}
> > +
> > +int veth_xsk_umem_setup(struct net_device *dev, struct xdp_umem *umem,
> > +                     u16 qid)
> > +{
> > +     return umem ? veth_xsk_umem_enable(dev, umem, qid) :
> > +                   veth_xsk_umem_disable(dev, qid);
> > +}
> > +
> > +int veth_xsk_async_xmit(struct net_device *dev, u32 qid)
> > +{
> > +     struct veth_priv *priv, *peer_priv;
> > +     struct net_device *peer_dev;
> > +     struct veth_rq *peer_rq;
> > +
> > +     priv = netdev_priv(dev);
> > +     peer_dev = priv->peer;
> > +     peer_priv = netdev_priv(peer_dev);
> > +     peer_rq = &peer_priv->rq[qid];
> > +
> > +     if (qid >= dev->real_num_rx_queues)
> > +             return -ENXIO;
> > +
> > +     /* Schedule the peer side NAPI to receive */
> > +     if (!napi_if_scheduled_mark_missed(&peer_rq->xdp_napi))
> > +             napi_schedule(&peer_rq->xdp_napi);
> > +
> > +     return 0;
> > +}
> > +
> >   static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >   {
> >       switch (xdp->command) {
> > @@ -1123,6 +1298,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       case XDP_QUERY_PROG:
> >               xdp->prog_id = veth_xdp_query(dev);
> >               return 0;
> > +     case XDP_QUERY_XSK_UMEM:
> > +             return veth_xsk_umem_query(dev, &xdp->xsk.umem,
> > +                                        xdp->xsk.queue_id);
> > +     case XDP_SETUP_XSK_UMEM: {
> > +             struct veth_priv *priv;
> > +             int err;
> > +
> > +             /* Enable NAPI on both sides, by enabling
> > +              * their XDP.
> > +              */
> > +             err = veth_enable_xdp(dev);
>
> Looks like there is no need to enable XDP on this side. You only use
> peer's NAPI, right?

Yes, thanks. The peer's NAPI is doing the work.

>
> > +             if (err)
> > +                     return err;
> > +
> > +             priv = netdev_priv(dev);
> > +             err = veth_enable_xdp(priv->peer);
>
> Enabling NAPI here and never disable it?

Here XDP_SETUP_XSK_UMEM might be enable or disable,
so I should disable NAPI when receiving umem disable command.
Will fix it in next version.

> Also, what happens if peer disables XDP later by detaching an XDP program?
> Probably you need something like refcounting.

OK, so the refconting will keep the NAPI running when XDP program is detached.
>
>
> BTW I'm not 100% sure if current way of accessing peer rq in NAPI
> handler is safe although I did not find an obvious problem.
>
> Looking into physical NIC drivers, ndo_async_xmit() is expected to kick
> TX softirq? This should be something like below process in veth.

right, the physical NIC driver (i40e and ixgbe), the async_xmit simply
kick TX softirq. For veth, we're doing a little different. We kick the peer
side's RX softirq.

>
> Say you attach an xsk to veth A, and B is the peer of A.
>
> 1. async_xmit kicks A's NAPI
> 2. A's NAPI drains xsk tx_ring and pushes frames into B's xdp_ring, then
> kicks B's NAPI
> 3. B's NAPI drains xdp_ring and process xdp_frames in the same way as
> normal xdp_frames.
>
> What you are currently doing seems to be skipping 2 and making B's NAPI
> directly drain tx_ring of xsk bound to A.
Right.

>
> I don't have particular opinion about which is better. Probably your
> approach is more efficient performance-wise? If later you find some race
> with your approach, please consider more physical-NIC-like approach I
> described above.

Yes, I thought without pushing/draining frames into/outof xdp_ring
might have better performance.

I will work on the next version. Thanks!
William

>
> > +             if (err)
> > +                     return err;
> > +
> > +             return veth_xsk_umem_setup(dev, xdp->xsk.umem,
> > +                                        xdp->xsk.queue_id);
> > +     }
> >       default:
> >               return -EINVAL;
> >       }
> > @@ -1145,6 +1342,7 @@ static const struct net_device_ops veth_netdev_ops = {
> >       .ndo_set_rx_headroom    = veth_set_rx_headroom,
> >       .ndo_bpf                = veth_xdp,
> >       .ndo_xdp_xmit           = veth_xdp_xmit,
> > +     .ndo_xsk_async_xmit     = veth_xsk_async_xmit,
> >   };
> >
> >   #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
> >