netdev - Re: thoughts stac/clac and get user for vhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <042e0002-0dce-42e4-8694-4f3fa96c3975@redhat.com>
Date:   Thu, 27 Dec 2018 17:55:52 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     netdev@...r.kernel.org
Subject: Re: thoughts stac/clac and get user for vhost

On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>> Hi!
>>> I was just wondering: packed ring batches things naturally.
>>> E.g.
>>>
>>> user_access_begin
>>> check descriptor valid
>>> smp_rmb
>>> copy descriptor
>>> user_access_end
>>
>> But without speculation on the descriptor (which may only work for in-order
>> or even a violation of spec). Only one two access of a single descriptor
>> could be batched. For split ring, we can batch more since we know how many
>> descriptors is pending. (avail_idx - last_avail_idx).
>>
>> Anything I miss?
>>
>> Thanks
>>
> just check more descriptors in a loop:
>
>   user_access_begin
>   for (i = 0; i < 16; ++i) {
> 	 if (!descriptor valid)
> 		break;
> 	 smp_rmb
> 	 copy descriptor
>   }
>   user_access_end
>
> you don't really need to know how many there are
> ahead of the time as you still copy them 1 by one.

So let's see the case of split ring

user_access_begin

n = avail_idx - last_avail_idx (1)

n = MIN(n, 16)

smp_rmb

read n entries from avail_ring (2)

for (i =0; i <n; i++)

     copy descriptor (3)

user_access_end

Consider for the case of heavy workload. So for packed ring, we have 32 
times of userspace access and 16 times of smp_rmb()

For split ring we have

(1) 1 time

(2) 2 times at most

(3) 16 times

19 times of userspace access and 1 times of smp_rmb(). In fact 2 could 
be eliminated with in order. 3 could be batched completely with in order 
and partially when out of order.

I don't see how packed ring help here especially consider lfence on x86 
is more than memory fence, it prevents speculation in fact.

Thanks

>
>
>>> So packed layout should show the gain with this approach.
>>> That could be motivation enough to finally enable vhost packed ring
>>> support.
>>>
>>> Thoughts?
>>>