netdev - Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched processing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Thu, 28 Sep 2017 15:02:48 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     virtualization@...ts.linux-foundation.org, netdev@...r.kernel.org,
        linux-kernel@...r.kernel.org, kvm@...r.kernel.org
Subject: Re: [PATCH net-next RFC 5/5] vhost_net: basic tx virtqueue batched
 processing



On 2017年09月28日 06:19, Michael S. Tsirkin wrote:
> On Wed, Sep 27, 2017 at 10:04:18AM +0800, Jason Wang wrote:
>>
>> On 2017年09月27日 03:25, Michael S. Tsirkin wrote:
>>> On Fri, Sep 22, 2017 at 04:02:35PM +0800, Jason Wang wrote:
>>>> This patch implements basic batched processing of tx virtqueue by
>>>> prefetching desc indices and updating used ring in a batch. For
>>>> non-zerocopy case, vq->heads were used for storing the prefetched
>>>> indices and updating used ring. It is also a requirement for doing
>>>> more batching on top. For zerocopy case and for simplicity, batched
>>>> processing were simply disabled by only fetching and processing one
>>>> descriptor at a time, this could be optimized in the future.
>>>>
>>>> XDP_DROP (without touching skb) on tun (with Moongen in guest) with
>>>> zercopy disabled:
>>>>
>>>> Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz:
>>>> Before: 3.20Mpps
>>>> After:  3.90Mpps (+22%)
>>>>
>>>> No differences were seen with zerocopy enabled.
>>>>
>>>> Signed-off-by: Jason Wang <jasowang@...hat.com>
>>> So where is the speedup coming from? I'd guess the ring is
>>> hot in cache, it's faster to access it in one go, then
>>> pass many packets to net stack. Is that right?
>>>
>>> Another possibility is better code cache locality.
>> Yes, I think the speed up comes from:
>>
>> - less cache misses
>> - less cache line bounce when virtqueue is about to be full (guest is faster
>> than host which is the case of MoonGen)
>> - less memory barriers
>> - possible faster copy speed by using copy_to_user() on modern CPUs
>>
>>> So how about this patchset is refactored:
>>>
>>> 1. use existing APIs just first get packets then
>>>      transmit them all then use them all
>> Looks like current API can not get packets first, it only support get packet
>> one by one (if you mean vhost_get_vq_desc()). And used ring updating may get
>> more misses in this case.
> Right. So if you do
>
> for (...)
> 	vhost_get_vq_desc
>
>
> then later
>
> for (...)
> 	vhost_add_used
>
>
> then you get most of benefits except maybe code cache misses
> and copy_to_user.
>

If you mean vhost_add_used_n(), then more barriers and userspace memory 
access as well. And you still need to cache vring_used_elem in vq->heads 
or elsewhere. Since you do not batch reading avail ring indexes, more 
write miss will happen if the ring is about to be full. So it looks 
slower than what is done in this patch?

And if we want to indo more batching on top (do we want this?), we still 
need new APIs anyway.

Thanks

>
>
>
>
>
>>> 2. add new APIs and move the loop into vhost core
>>>      for more speedups
>> I don't see any advantages, looks like just need some e.g callbacks in this
>> case.
>>
>> Thanks
> IUC callbacks pretty much destroy the code cache locality advantages,
> IP is jumping around too much.
>
>