[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3a58f172-f36f-044d-f8ac-8e24b2dc61a5@redhat.com>
Date: Wed, 2 Jan 2019 11:25:14 +0800
From: Jason Wang <jasowang@...hat.com>
To: "Michael S. Tsirkin" <mst@...hat.com>
Cc: netdev@...r.kernel.org
Subject: Re: thoughts stac/clac and get user for vhost
On 2018/12/31 上午2:40, Michael S. Tsirkin wrote:
> On Thu, Dec 27, 2018 at 05:55:52PM +0800, Jason Wang wrote:
>> On 2018/12/26 下午11:06, Michael S. Tsirkin wrote:
>>> On Wed, Dec 26, 2018 at 12:03:50PM +0800, Jason Wang wrote:
>>>> On 2018/12/26 上午12:41, Michael S. Tsirkin wrote:
>>>>> Hi!
>>>>> I was just wondering: packed ring batches things naturally.
>>>>> E.g.
>>>>>
>>>>> user_access_begin
>>>>> check descriptor valid
>>>>> smp_rmb
>>>>> copy descriptor
>>>>> user_access_end
>>>> But without speculation on the descriptor (which may only work for in-order
>>>> or even a violation of spec). Only one two access of a single descriptor
>>>> could be batched. For split ring, we can batch more since we know how many
>>>> descriptors is pending. (avail_idx - last_avail_idx).
>>>>
>>>> Anything I miss?
>>>>
>>>> Thanks
>>>>
>>> just check more descriptors in a loop:
>>>
>>> user_access_begin
>>> for (i = 0; i < 16; ++i) {
>>> if (!descriptor valid)
>>> break;
>>> smp_rmb
>>> copy descriptor
>>> }
>>> user_access_end
>>>
>>> you don't really need to know how many there are
>>> ahead of the time as you still copy them 1 by one.
>>
>> So let's see the case of split ring
>>
>>
>> user_access_begin
>>
>> n = avail_idx - last_avail_idx (1)
>>
>> n = MIN(n, 16)
>>
>> smp_rmb
>>
>> read n entries from avail_ring (2)
>>
>> for (i =0; i <n; i++)
>>
>> copy descriptor (3)
>>
>> user_access_end
>>
>>
>> Consider for the case of heavy workload. So for packed ring, we have 32
>> times of userspace access and 16 times of smp_rmb()
>>
>> For split ring we have
>>
>> (1) 1 time
>>
>> (2) 2 times at most
>>
>> (3) 16 times
>>
>> 19 times of userspace access and 1 times of smp_rmb(). In fact 2 could be
>> eliminated with in order. 3 could be batched completely with in order and
>> partially when out of order.
>>
>> I don't see how packed ring help here especially consider lfence on x86 is
>> more than memory fence, it prevents speculation in fact.
>>
>> Thanks
> So on x86 at least RMB is free, this is why I never bothered optimizing
> it out. Is smp_rmb still worth optimizing out for ARM? Does it cost
> more than the extra indirection in the split ring?
I don't know, but obviously, RMB has a chance to damage the performance
more or less. But even on arch where the RMB is free, packed ring still
does not show obvious advantage.
>
> But my point was really fundamental - if ring accesses are expensive
> then we should batch them.
I don't object the batching, the reason that they are expensive could be:
1) unnecessary overhead caused by speculation barrier and check likes SMAP
2) cache contention
So it does not conflict with the effort that I did to remove 1). My plan
is: for metadata, try to eliminate all the 1) completely. For data, we
can do batch copying to amortize its effort. For avail/descriptor
batching, we can try to it on top.
> Right now we have an API that gets
> an iovec directly. That limits the optimizations you can do.
>
> The translation works like this:
>
> ring -> valid descriptors -> iovecs
>
> We should have APIs for each step that work in batches.
>
Yes.
Thanks
>
>>>
>>>>> So packed layout should show the gain with this approach.
>>>>> That could be motivation enough to finally enable vhost packed ring
>>>>> support.
>>>>>
>>>>> Thoughts?
>>>>>
Powered by blists - more mailing lists