netdev - Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <16ea7512-d770-21ef-edb6-3ada51f08592@redhat.com>
Date:   Wed, 27 Sep 2017 10:04:18 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched
 processing



On 2017年09月27日 03:25, Michael S. Tsirkin wrote:
> On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:
>> This patch implements basic batched processing of tx virtqueue by
>> prefetching desc indices and updating used ring in a batch. For
>> non-zerocopy case, vq->heads were used for storing the prefetched
>> indices and updating used ring. It is also a requirement for doing
>> more batching on top. For zerocopy case and for simplicity, batched
>> processing were simply disabled by only fetching and processing one
>> descriptor at a time, this could be optimized in the future.
>>
>> XDP_DROP (without touching skb) on tun (with Moongen in guest) with
>> zercopy disabled:
>>
>> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
>> Before: 3.20Mpps
>> After:  3.90Mpps (+22%)
>>
>> No differences were seen with zerocopy enabled.
>>
>> Signed-off-by: Jason Wang <jasowang@...hat.com>
> So where is the speedup coming from? I'd guess the ring is
> hot in cache, it's faster to access it in one go, then
> pass many packets to net stack. Is that right?
>
> Another possibility is better code cache locality.

Yes, I think the speed up comes from:

- less cache misses
- less cache line bounce when virtqueue is about to be full (guest is 
faster than host which is the case of MoonGen)
- less memory barriers
- possible faster copy speed by using copy_to_user() on modern CPUs

>
> So how about this patchset is refactored:
>
> 1. use existing APIs just first get packets then
>     transmit them all then use them all

Looks like current API can not get packets first, it only support get 
packet one by one (if you mean vhost_get_vq_desc()). And used ring 
updating may get more misses in this case.

> 2. add new APIs and move the loop into vhost core
>     for more speedups

I don't see any advantages, looks like just need some e.g callbacks in 
this case.

Thanks