linux-kernel - Re: [PATCH 1/3] ptr_ring: batch ring zeroing

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <34324c4c-54f9-0323-def3-3c2f7c6ead74@redhat.com>
Date:   Tue, 18 Apr 2017 10:16:57 +0800
From:   Jason Wang <jasowang@...hat.com>
To:     "Michael S. Tsirkin" <mst@...hat.com>
Cc:     linux-kernel@...r.kernel.org, brouer@...hat.com
Subject: Re: [PATCH 1/3] ptr_ring: batch ring zeroing



On 2017年04月15日 05:00, Michael S. Tsirkin wrote:
> On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:
>>
>> On 2017年04月12日 16:03, Jason Wang wrote:
>>>
>>> On 2017年04月07日 13:49, Michael S. Tsirkin wrote:
>>>> A known weakness in ptr_ring design is that it does not handle well the
>>>> situation when ring is almost full: as entries are consumed they are
>>>> immediately used again by the producer, so consumer and producer are
>>>> writing to a shared cache line.
>>>>
>>>> To fix this, add batching to consume calls: as entries are
>>>> consumed do not write NULL into the ring until we get
>>>> a multiple (in current implementation 2x) of cache lines
>>>> away from the producer. At that point, write them all out.
>>>>
>>>> We do the write out in the reverse order to keep
>>>> producer from sharing cache with consumer for as long
>>>> as possible.
>>>>
>>>> Writeout also triggers when ring wraps around - there's
>>>> no special reason to do this but it helps keep the code
>>>> a bit simpler.
>>>>
>>>> What should we do if getting away from producer by 2 cache lines
>>>> would mean we are keeping the ring moe than half empty?
>>>> Maybe we should reduce the batching in this case,
>>>> current patch simply reduces the batching.
>>>>
>>>> Notes:
>>>> - it is no longer true that a call to consume guarantees
>>>>     that the following call to produce will succeed.
>>>>     No users seem to assume that.
>>>> - batching can also in theory reduce the signalling rate:
>>>>     users that would previously send interrups to the producer
>>>>     to wake it up after consuming each entry would now only
>>>>     need to do this once in a batch.
>>>>     Doing this would be easy by returning a flag to the caller.
>>>>     No users seem to do signalling on consume yet so this was not
>>>>     implemented yet.
>>>>
>>>> Signed-off-by: Michael S. Tsirkin<mst@...hat.com>
>>>> ---
>>>>
>>>> Jason, I am curious whether the following gives you some of
>>>> the performance boost that you see with vhost batching
>>>> patches. Is vhost batching on top still helpful?
>>> The patch looks good to me, will have a test for vhost batching patches.
>>>
>>> Thanks
>> Still helpful:
>>
>> before this patch: 1.84Mpps
>> with this patch: 2.00Mpps
>> with batch dequeuing: 2.30Mpps
>>
>> Acked-by: Jason Wang <jasowang@...hat.com>
>>
>> Thanks
> Fascinating. How do we explain the gain with batch dequeue?

I count the drop rate (pktgen on tun and count tun tx) and maybe it can 
explain more or less:

Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 
11% dropped
After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 
14% dropped
With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 
2.26Mpps ~0% dropped

With this patch, we remove the cache contention by blocking the producer 
more or less. With batch dequeuing, the ring is not full in 99% of the 
cases which probably means the producer is not blocked for most of the time.

> Is it just the lock overhead?

I remove the spinlocks for peeking and dequeuing on top of this patch. 
The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only 
the lock overhead.

> Can you pls try to replace
> the lock with a simple non-fair atomic and see what happens?
>

Not sure I get the idea, we are going for fast path of spinlocks for 
sure which is just a cmpxchg().

Thanks