netdev - Re: [PATCH bpf-next 13/15] xsk: support for Tx

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJ8uoz1crwU_no=dbQsZsjxjRX+iDVv0pxC6+pPhVzNkCxh-3A@mail.gmail.com>
Date:   Wed, 25 Apr 2018 11:11:42 +0200
From:   Magnus Karlsson <magnus.karlsson@...il.com>
To:     Willem de Bruijn <willemdebruijn.kernel@...il.com>
Cc:     Björn Töpel <bjorn.topel@...il.com>,
        "Karlsson, Magnus" <magnus.karlsson@...el.com>,
        Alexander Duyck <alexander.h.duyck@...el.com>,
        Alexander Duyck <alexander.duyck@...il.com>,
        John Fastabend <john.fastabend@...il.com>,
        Alexei Starovoitov <ast@...com>,
        Jesper Dangaard Brouer <brouer@...hat.com>,
        Daniel Borkmann <daniel@...earbox.net>,
        "Michael S. Tsirkin" <mst@...hat.com>,
        Network Development <netdev@...r.kernel.org>,
        michael.lundkvist@...csson.com,
        "Brandeburg, Jesse" <jesse.brandeburg@...el.com>,
        "Singhai, Anjali" <anjali.singhai@...el.com>,
        "Zhang, Qi Z" <qi.z.zhang@...el.com>
Subject: Re: [PATCH bpf-next 13/15] xsk: support for Tx

On Tue, Apr 24, 2018 at 6:57 PM, Willem de Bruijn
<willemdebruijn.kernel@...il.com> wrote:
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel <bjorn.topel@...il.com> wrote:
>> From: Magnus Karlsson <magnus.karlsson@...el.com>
>>
>> Here, Tx support is added. The user fills the Tx queue with frames to
>> be sent by the kernel, and let's the kernel know using the sendmsg
>> syscall.
>>
>> Signed-off-by: Magnus Karlsson <magnus.karlsson@...el.com>
>
>> +static int xsk_xmit_skb(struct sk_buff *skb)
>
> This is basically packet_direct_xmit. Might be better to just move that
> to net/core/dev.c and use in both AF_PACKET and AF_XDP.

It is packet_direct_xmit with some code removed that is not used :-),
so your suggestion makes a lot of sense. Will implement in this patch
set.

> Also, (eventually) AF_XDP may also want to support the regular path
> through dev_queue_xmit to go through traffic shaping.

Agreed. Will put this on the todo list for a later patch.

>> +{
>> +       struct net_device *dev = skb->dev;
>> +       struct sk_buff *orig_skb = skb;
>> +       struct netdev_queue *txq;
>> +       int ret = NETDEV_TX_BUSY;
>> +       bool again = false;
>> +
>> +       if (unlikely(!netif_running(dev) || !netif_carrier_ok(dev)))
>> +               goto drop;
>> +
>> +       skb = validate_xmit_skb_list(skb, dev, &again);
>> +       if (skb != orig_skb)
>> +               return NET_XMIT_DROP;
>
> Need to free generated segment list on error, see packet_direct_xmit.

I do not use segments in the TX code for reasons of simplicity and the
free is in the calling function. But as I will create a common
packet_direct_xmit according to your suggestion, it will have a
kfree_skb_list() there as in af_packet.c.

>> +
>> +       txq = skb_get_tx_queue(dev, skb);
>> +
>> +       local_bh_disable();
>> +
>> +       HARD_TX_LOCK(dev, txq, smp_processor_id());
>> +       if (!netif_xmit_frozen_or_drv_stopped(txq))
>> +               ret = netdev_start_xmit(skb, dev, txq, false);
>> +       HARD_TX_UNLOCK(dev, txq);
>> +
>> +       local_bh_enable();
>> +
>> +       if (!dev_xmit_complete(ret))
>> +               goto out_err;
>> +
>> +       return ret;
>> +drop:
>> +       atomic_long_inc(&dev->tx_dropped);
>> +out_err:
>> +       return NET_XMIT_DROP;
>> +}
>
>> +static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
>> +                           size_t total_len)
>> +{
>> +       bool need_wait = !(m->msg_flags & MSG_DONTWAIT);
>> +       u32 max_batch = TX_BATCH_SIZE;
>> +       struct xdp_sock *xs = xdp_sk(sk);
>> +       bool sent_frame = false;
>> +       struct xdp_desc desc;
>> +       struct sk_buff *skb;
>> +       int err = 0;
>> +
>> +       if (unlikely(!xs->tx))
>> +               return -ENOBUFS;
>> +       if (need_wait)
>> +               return -EOPNOTSUPP;
>> +
>> +       mutex_lock(&xs->mutex);
>> +
>> +       while (xskq_peek_desc(xs->tx, &desc)) {
>
> It is possible to pass a chain of skbs to validate_xmit_skb_list and
> eventually pass this chain to xsk_xmit_skb, amortizing the cost of
> taking the txq lock. Fine to ignore for this patch set.

Good suggestion. Will put it down on the todo list for a later patch set.

>> +               char *buffer;
>> +               u32 id, len;
>> +
>> +               if (max_batch-- == 0) {
>> +                       err = -EAGAIN;
>> +                       goto out;
>> +               }
>> +
>> +               if (xskq_reserve_id(xs->umem->cq)) {
>> +                       err = -EAGAIN;
>> +                       goto out;
>> +               }
>> +
>> +               len = desc.len;
>> +               if (unlikely(len > xs->dev->mtu)) {
>> +                       err = -EMSGSIZE;
>> +                       goto out;
>> +               }
>> +
>> +               skb = sock_alloc_send_skb(sk, len, !need_wait, &err);
>> +               if (unlikely(!skb)) {
>> +                       err = -EAGAIN;
>> +                       goto out;
>> +               }
>> +
>> +               skb_put(skb, len);
>> +               id = desc.idx;
>> +               buffer = xdp_umem_get_data(xs->umem, id) + desc.offset;
>> +               err = skb_store_bits(skb, 0, buffer, len);
>> +               if (unlikely(err))
>> +                       goto out_store;
>
> As xsk_destruct_skb delays notification until consume_skb is called, this
> copy can be avoided by linking the xdp buffer into the skb frags array,
> analogous to tpacket_snd.
>
> You probably don't care much about the copy slow path, and this can be
> implemented later, so also no need to do in this patchset.

Agreed. I will also put this in the todo list for a later patch set.

> static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
> +                                             struct xdp_desc *desc)
> +{
> +       struct xdp_rxtx_ring *ring;
> +
> +       if (q->cons_tail == q->cons_head) {
> +               WRITE_ONCE(q->ring->consumer, q->cons_tail);
> +               q->cons_head = q->cons_tail + xskq_nb_avail(q, RX_BATCH_SIZE);
> +
> +               /* Order consumer and data */
> +               smp_rmb();
> +
> +               return xskq_validate_desc(q, desc);
> +       }
> +
> +       ring = (struct xdp_rxtx_ring *)q->ring;
> +       *desc = ring->desc[q->cons_tail & q->ring_mask];
> +       return desc;
>
> This only validates descriptors if taking the branch.

Yes, that is because we only want to validate the descriptors once
even if we call this function multiple times for the same entry.

Thanks. Highly appreciated comments Will.

/Magnus