netdev - Re: [net-next PATCH v2 5/5] virtio_net: XDP support for adjust

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 6 Feb 2017 11:29:59 -0800
From:   John Fastabend <john.fastabend@...il.com>
To:     Jason Wang <jasowang@...hat.com>, kubakici@...pl, ast@...com,
        mst@...hat.com
Cc:     john.r.fastabend@...el.com, netdev@...r.kernel.org
Subject: Re: [net-next PATCH v2 5/5] virtio_net: XDP support for adjust_head

On 17-02-05 11:08 PM, Jason Wang wrote:
> 
> 
> On 2017年02月03日 11:16, John Fastabend wrote:
>> Add support for XDP adjust head by allocating a 256B header region
>> that XDP programs can grow into. This is only enabled when a XDP
>> program is loaded.
>>
>> In order to ensure that we do not have to unwind queue headroom push
>> queue setup below bpf_prog_add. It reads better to do a prog ref
>> unwind vs another queue setup call.
>>
>> At the moment this code must do a full reset to ensure old buffers
>> without headroom on program add or with headroom on program removal
>> are not used incorrectly in the datapath. Ideally we would only
>> have to disable/enable the RX queues being updated but there is no
>> API to do this at the moment in virtio so use the big hammer. In
>> practice it is likely not that big of a problem as this will only
>> happen when XDP is enabled/disabled changing programs does not
>> require the reset. There is some risk that the driver may either
>> have an allocation failure or for some reason fail to correctly
>> negotiate with the underlying backend in this case the driver will
>> be left uninitialized. I have not seen this ever happen on my test
>> systems and for what its worth this same failure case can occur
>> from probe and other contexts in virtio framework.
>>
>> Signed-off-by: John Fastabend <john.r.fastabend@...el.com>
>> ---


[...]

>> @@ -412,7 +418,6 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>       struct bpf_prog *xdp_prog;
>>         len -= vi->hdr_len;
>> -    skb_trim(skb, len);
>>         rcu_read_lock();
>>       xdp_prog = rcu_dereference(rq->xdp_prog);
>> @@ -424,12 +429,16 @@ static struct sk_buff *receive_small(struct net_device
>> *dev,
>>           if (unlikely(hdr->hdr.gso_type || hdr->hdr.flags))
>>               goto err_xdp;
>>   -        xdp.data = skb->data;
>> +        xdp.data_hard_start = skb->data;
>> +        xdp.data = skb->data + VIRTIO_XDP_HEADROOM;
>>           xdp.data_end = xdp.data + len;
>>           act = bpf_prog_run_xdp(xdp_prog, &xdp);
>>             switch (act) {
>>           case XDP_PASS:
>> +            /* Recalculate length in case bpf program changed it */
>> +            __skb_pull(skb, xdp.data - xdp.data_hard_start);
> 
> But skb->len were trimmed to len below which seems wrong.

I believe this is correct and it passes my basic iperf/ping tests.

When we are using small buffers with XDP, skb->data is pointing to the front
of the buffer. This space includes the XDP headroom. When we pass the skb up
to the stack we need to pull this off and point to the start of the data. But
there still is likely a bunch of room at the end of the buffer assuming the
packet is smaller than the buffer side.

> 
>> +            len = xdp.data_end - xdp.data;
>>               break;
>>           case XDP_TX:
>>               if (unlikely(!virtnet_xdp_xmit(vi, rq, &xdp, skb)))
>> @@ -446,6 +455,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
>>       }
>>       rcu_read_unlock();
>>   +    skb_trim(skb, len);

So here we trim the packet to set the length to the actual payload size. The
'len' parameter passed into receive_small does not include the headroom so this
gives us the correct length of the payload.

Make sense?

>>       return skb;
>>     err_xdp:

[...]

>> @@ -569,7 +580,7 @@ static struct sk_buff *receive_mergeable(struct net_device
>> *dev,
>>                                 page, offset, &len);
>>               if (!xdp_page)
>>                   goto err_xdp;
>> -            offset = 0;
>> +            offset = VIRTIO_XDP_HEADROOM;
>>           } else {
>>               xdp_page = page;
>>           }
>> @@ -582,19 +593,30 @@ static struct sk_buff *receive_mergeable(struct
>> net_device *dev,
>>           if (unlikely(hdr->hdr.gso_type))
>>               goto err_xdp;
>>   +        /* Allow consuming headroom but reserve enough space to push
>> +         * the descriptor on if we get an XDP_TX return code.
>> +         */
>>           data = page_address(xdp_page) + offset;
>> +        xdp.data_hard_start = data - VIRTIO_XDP_HEADROOM + vi->hdr_len;
> 
> Should be data - VIRTIO_XDP_HEADROOM I think?
> 

If the XDP program does an adjust_head() and then a XDP_TX I want to ensure
we reserve enough headroom to push the header onto the buffer when the packet
is sent. So the additional hdr_len reserve here is intentional. Otherwise we
would need to detect this and do some type of linearize action.

>>           xdp.data = data + vi->hdr_len;
>>           xdp.data_end = xdp.data + (len - vi->hdr_len);
>>           act = bpf_prog_run_xdp(xdp_prog, &xdp);
>>             switch (act) {
>>           case XDP_PASS:
>> +            /* recalculate offset to account for any header
>> +             * adjustments. Note other cases do not build an
>> +             * skb and avoid using offset
>> +             */
>> +            offset = xdp.data -
>> +                    page_address(xdp_page) - vi->hdr_len;
>> +
>>               /* We can only create skb based on xdp_page. */
>>               if (unlikely(xdp_page != page)) {
>>                   rcu_read_unlock();
>>                   put_page(page);
>>                   head_skb = page_to_skb(vi, rq, xdp_page,
>> -                               0, len, PAGE_SIZE);
>> +                               offset, len, PAGE_SIZE);
>>                   ewma_pkt_len_add(&rq->mrg_avg_pkt_len, len);
>>                   return head_skb;

[...]

>>   -static int add_recvbuf_mergeable(struct receive_queue *rq, gfp_t gfp)
>> +static int add_recvbuf_mergeable(struct virtnet_info *vi,
>> +                 struct receive_queue *rq, gfp_t gfp)
>>   {
>>       struct page_frag *alloc_frag = &rq->alloc_frag;
>> +    unsigned int headroom = virtnet_get_headroom(vi);
>>       char *buf;
>>       unsigned long ctx;
>>       int err;
>>       unsigned int len, hole;
>>         len = get_mergeable_buf_len(&rq->mrg_avg_pkt_len);
>> -    if (unlikely(!skb_page_frag_refill(len, alloc_frag, gfp)))
>> +    if (unlikely(!skb_page_frag_refill(len + headroom, alloc_frag, gfp)))
>>           return -ENOMEM;
>>         buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
>> +    buf += headroom; /* advance address leaving hole at front of pkt */
> 
> Note: the headroom will reduce the possibility of frag coalescing which may
> damage the performance more or less.
> 
> [...]

Right there are a few other performance optimizations I am looking at in
virtio as well but these should go in as follow on series.

Specifically, I'm looking at recycling buffers to see what sort of performance
increase we can get out of that. Many of the hardware drivers do this and see
a performance boost from it. However dynamic buffer sizes like this make it a
bit challenging.

.John