netdev - Re: thoughts stac/clac and get user for vhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3a58f172-f36f-044d-f8ac-8e24b2dc61a5@redhat.com>
Date:   Wed, 2 Jan 2019 11:25:14 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     netdev@...r.kernel.org
Subject: Re: thoughts stac/clac and get user for vhost


On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>> Hi!
>>>>> I was just wondering: packed ring batches things naturally.
>>>>> E.g.
>>>>>
>>>>> user_access_begin
>>>>> check descriptor valid
>>>>> smp_rmb
>>>>> copy descriptor
>>>>> user_access_end
>>>> But without speculation on the descriptor (which may only work for in-order
>>>> or even a violation of spec). Only one two access of a single descriptor
>>>> could be batched. For split ring, we can batch more since we know how many
>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>
>>>> Anything I miss?
>>>>
>>>> Thanks
>>>>
>>> just check more descriptors in a loop:
>>>
>>>    user_access_begin
>>>    for (i = 0; i < 16; ++i) {
>>> 	 if (!descriptor valid)
>>> 		break;
>>> 	 smp_rmb
>>> 	 copy descriptor
>>>    }
>>>    user_access_end
>>>
>>> you don't really need to know how many there are
>>> ahead of the time as you still copy them 1 by one.
>>
>> So let's see the case of split ring
>>
>>
>> user_access_begin
>>
>> n = avail_idx - last_avail_idx (1)
>>
>> n = MIN(n, 16)
>>
>> smp_rmb
>>
>> read n entries from avail_ring (2)
>>
>> for (i =0; i <n; i++)
>>
>>      copy descriptor (3)
>>
>> user_access_end
>>
>>
>> Consider for the case of heavy workload. So for packed ring, we have 32
>> times of userspace access and 16 times of smp_rmb()
>>
>> For split ring we have
>>
>> (1) 1 time
>>
>> (2) 2 times at most
>>
>> (3) 16 times
>>
>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>> eliminated with in order. 3 could be batched completely with in order and
>> partially when out of order.
>>
>> I don't see how packed ring help here especially consider lfence on x86 is
>> more than memory fence, it prevents speculation in fact.
>>
>> Thanks
> So on x86 at least RMB is free, this is why I never bothered optimizing
> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
> more than the extra indirection in the split ring?


I don't know, but obviously, RMB has a chance to damage the performance 
more or less. But even on arch where the RMB is free, packed ring still 
does not show obvious advantage.


>
> But my point was really fundamental - if ring accesses are expensive
> then we should batch them.


I don't object the batching, the reason that they are expensive could be:

1) unnecessary overhead caused by speculation barrier and check likes SMAP

2) cache contention

So it does not conflict with the effort that I did to remove 1). My plan 
is: for metadata, try to eliminate all the 1) completely. For data, we 
can do batch copying to amortize its effort. For avail/descriptor 
batching, we can try to it on top.


>   Right now we have an API that gets
> an iovec directly. That limits the optimizations you can do.
>
> The translation works like this:
>
> ring -> valid descriptors -> iovecs
>
> We should have APIs for each step that work in batches.
>

Yes.

Thanks


>
>>>
>>>>> So packed layout should show the gain with this approach.
>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>> support.
>>>>>
>>>>> Thoughts?
>>>>>