linux-kernel - Re: [PATCH v5 2/2] skb

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 24 May 2016 23:34:14 +0300
From:	"Michael S. Tsirkin" <mst@...hat.com>
To:	Jesper Dangaard Brouer <brouer@...hat.com>
Cc:	linux-kernel@...r.kernel.org, Jason Wang <jasowang@...hat.com>,
	Eric Dumazet <eric.dumazet@...il.com>, davem@...emloft.net,
	netdev@...r.kernel.org, Steven Rostedt <rostedt@...dmis.org>
Subject: Re: [PATCH v5 2/2] skb_array: ring test

On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote:
> 
> On Tue, 24 May 2016 12:28:09 +0200
> Jesper Dangaard Brouer <brouer@...hat.com> wrote:
> 
> > I do like perf, but it does not answer my questions about the
> > performance of this queue. I will code something up in my own
> > framework[2] to answer my own performance questions.
> > 
> > Like what is be minimum overhead (in cycles) achievable with this type
> > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot)
> > for fastpath usage.
> 
> Coded it up here:
>  https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> 
> This is a really fake benchmark, but it sort of shows the minimum
> overhead achievable with this type of queue, where it is the same
> CPU enqueuing and dequeuing, and cache is guaranteed to be hot.
> 
> Measured on a i7-4790K CPU @ 4.00GHz, the average cost of
> enqueue+dequeue of a single object is around 102 cycles(tsc).
> 
> To compare this with below, where enq and deq is measured separately:
>  102 / 2 = 51 cycles
> 
>  
> > Then I also want to know how this performs when two CPUs are involved.
> > As this is also a primary use-case, for you when sending packets into a
> > guest.
> 
> Coded it up here:
>  https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e
>  https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c
>  
> This parallel benchmark try to keep two (or more) CPUs busy enqueuing or
> dequeuing on the same skb_array queue.  It prefills the queue,
> and stops the test as soon as queue is empty or full, or
> completes a number of "loops"/cycles.
> 
> For two CPUs the results are really good:
>  enqueue: 54 cycles(tsc)
>  dequeue: 53 cycles(tsc)
> 
> Going to 4 CPUs, things break down (but it was not primary use-case?):
>  CPU(0) 927 cycles(tsc) enqueue
>  CPU(1) 921 cycles(tsc) dequeue
>  CPU(2) 927 cycles(tsc) enqueue
>  CPU(3) 898 cycles(tsc) dequeue

It's mostly the spinlock contention I guess.
Maybe we don't need fair spinlocks in this case.
Try replacing spinlocks with simple cmpxchg
and see what happens?