lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20160524224710-mutt-send-email-mst@redhat.com> Date: Tue, 24 May 2016 23:34:14 +0300 From: "Michael S. Tsirkin" <mst@...hat.com> To: Jesper Dangaard Brouer <brouer@...hat.com> Cc: linux-kernel@...r.kernel.org, Jason Wang <jasowang@...hat.com>, Eric Dumazet <eric.dumazet@...il.com>, davem@...emloft.net, netdev@...r.kernel.org, Steven Rostedt <rostedt@...dmis.org> Subject: Re: [PATCH v5 2/2] skb_array: ring test On Tue, May 24, 2016 at 07:03:20PM +0200, Jesper Dangaard Brouer wrote: > > On Tue, 24 May 2016 12:28:09 +0200 > Jesper Dangaard Brouer <brouer@...hat.com> wrote: > > > I do like perf, but it does not answer my questions about the > > performance of this queue. I will code something up in my own > > framework[2] to answer my own performance questions. > > > > Like what is be minimum overhead (in cycles) achievable with this type > > of queue, in the most optimal situation (e.g. same CPU enq+deq cache hot) > > for fastpath usage. > > Coded it up here: > https://github.com/netoptimizer/prototype-kernel/commit/b16a3332184 > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c > > This is a really fake benchmark, but it sort of shows the minimum > overhead achievable with this type of queue, where it is the same > CPU enqueuing and dequeuing, and cache is guaranteed to be hot. > > Measured on a i7-4790K CPU @ 4.00GHz, the average cost of > enqueue+dequeue of a single object is around 102 cycles(tsc). > > To compare this with below, where enq and deq is measured separately: > 102 / 2 = 51 cycles > > > > Then I also want to know how this performs when two CPUs are involved. > > As this is also a primary use-case, for you when sending packets into a > > guest. > > Coded it up here: > https://github.com/netoptimizer/prototype-kernel/commit/75fe31ef62e > https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_parallel01.c > > This parallel benchmark try to keep two (or more) CPUs busy enqueuing or > dequeuing on the same skb_array queue. It prefills the queue, > and stops the test as soon as queue is empty or full, or > completes a number of "loops"/cycles. > > For two CPUs the results are really good: > enqueue: 54 cycles(tsc) > dequeue: 53 cycles(tsc) > > Going to 4 CPUs, things break down (but it was not primary use-case?): > CPU(0) 927 cycles(tsc) enqueue > CPU(1) 921 cycles(tsc) dequeue > CPU(2) 927 cycles(tsc) enqueue > CPU(3) 898 cycles(tsc) dequeue It's mostly the spinlock contention I guess. Maybe we don't need fair spinlocks in this case. Try replacing spinlocks with simple cmpxchg and see what happens?
Powered by blists - more mailing lists