netdev - Re: thoughts stac/clac and get user for vhost

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <bf342387-bcd3-48e1-4337-af60c6ef6575@redhat.com>
Date:   Mon, 7 Jan 2019 12:26:51 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     netdev@...r.kernel.org
Subject: Re: thoughts stac/clac and get user for vhost


On 2019/1/5 上午5:25, Michael S. Tsirkin wrote:
> On Wed, Jan 02, 2019 at 11:25:14AM +0800, Jason Wang wrote:
>> On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
>>> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>>>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>>>> Hi!
>>>>>>> I was just wondering: packed ring batches things naturally.
>>>>>>> E.g.
>>>>>>>
>>>>>>> user_access_begin
>>>>>>> check descriptor valid
>>>>>>> smp_rmb
>>>>>>> copy descriptor
>>>>>>> user_access_end
>>>>>> But without speculation on the descriptor (which may only work for in-order
>>>>>> or even a violation of spec). Only one two access of a single descriptor
>>>>>> could be batched. For split ring, we can batch more since we know how many
>>>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>>>
>>>>>> Anything I miss?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>> just check more descriptors in a loop:
>>>>>
>>>>>     user_access_begin
>>>>>     for (i = 0; i < 16; ++i) {
>>>>> 	 if (!descriptor valid)
>>>>> 		break;
>>>>> 	 smp_rmb
>>>>> 	 copy descriptor
>>>>>     }
>>>>>     user_access_end
>>>>>
>>>>> you don't really need to know how many there are
>>>>> ahead of the time as you still copy them 1 by one.
>>>> So let's see the case of split ring
>>>>
>>>>
>>>> user_access_begin
>>>>
>>>> n = avail_idx - last_avail_idx (1)
>>>>
>>>> n = MIN(n, 16)
>>>>
>>>> smp_rmb
>>>>
>>>> read n entries from avail_ring (2)
>>>>
>>>> for (i =0; i <n; i++)
>>>>
>>>>       copy descriptor (3)
>>>>
>>>> user_access_end
>>>>
>>>>
>>>> Consider for the case of heavy workload. So for packed ring, we have 32
>>>> times of userspace access and 16 times of smp_rmb()
>>>>
>>>> For split ring we have
>>>>
>>>> (1) 1 time
>>>>
>>>> (2) 2 times at most
>>>>
>>>> (3) 16 times
>>>>
>>>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>>>> eliminated with in order. 3 could be batched completely with in order and
>>>> partially when out of order.
>>>>
>>>> I don't see how packed ring help here especially consider lfence on x86 is
>>>> more than memory fence, it prevents speculation in fact.
>>>>
>>>> Thanks
>>> So on x86 at least RMB is free, this is why I never bothered optimizing
>>> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
>>> more than the extra indirection in the split ring?
>>
>> I don't know, but obviously, RMB has a chance to damage the performance more
>> or less. But even on arch where the RMB is free, packed ring still does not
>> show obvious advantage.
> People do measure gains with a PMD on host+guest.
> So it's a question of optimizing the packed ring implementation in Linux.


Well, 2%-3% difference is not quite a lot.

I think it's not hard to let split ring faster have some small 
optimizations on the code itself.

Thanks


>
>
>>> But my point was really fundamental - if ring accesses are expensive
>>> then we should batch them.
>>
>> I don't object the batching, the reason that they are expensive could be:
>>
>> 1) unnecessary overhead caused by speculation barrier and check likes SMAP
>> 2) cache contention
>>
>> So it does not conflict with the effort that I did to remove 1). My plan is:
>> for metadata, try to eliminate all the 1) completely. For data, we can do
>> batch copying to amortize its effort. For avail/descriptor batching, we can
>> try to it on top.
>>
>>
>>>    Right now we have an API that gets
>>> an iovec directly. That limits the optimizations you can do.
>>>
>>> The translation works like this:
>>>
>>> ring -> valid descriptors -> iovecs
>>>
>>> We should have APIs for each step that work in batches.
>>>
>> Yes.
>>
>> Thanks
>>
>>
>>>>>>> So packed layout should show the gain with this approach.
>>>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>>>> support.
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>