[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-id: <553F601C.6030309@samsung.com>
Date: Tue, 28 Apr 2015 12:25:32 +0200
From: Lukasz Skalski <l.skalski@...sung.com>
To: Andy Lutomirski <luto@...capital.net>
Cc: Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
Linus Torvalds <torvalds@...ux-foundation.org>,
Andrew Morton <akpm@...ux-foundation.org>,
Arnd Bergmann <arnd@...db.de>,
"Eric W. Biederman" <ebiederm@...ssion.com>,
One Thousand Gnomes <gnomes@...rguk.ukuu.org.uk>,
Tom Gundersen <teg@...m.no>, Jiri Kosina <jkosina@...e.cz>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
Daniel Mack <daniel@...que.org>,
David Herrmann <dh.herrmann@...il.com>,
Djalal Harouni <tixxdz@...ndz.org>
Subject: Re: D-bus is three orders of magnitude too slow (Re: [GIT PULL] kdbus
for 4.1-rc1)
On 04/27/2015 10:08 PM, Andy Lutomirski wrote:
> On Fri, Apr 24, 2015 at 6:50 AM, Lukasz Skalski <l.skalski@...sung.com> wrote:
>> Hi All,
>>
>> On 04/23/2015 07:16 PM, Greg Kroah-Hartman wrote:
>>> On Thu, Apr 23, 2015 at 09:46:22AM -0700, Andy Lutomirski wrote:
>>>> - There's still an open performance question. Namely: is kdbus performant?
>>>
>>> Yes, I thought that was already answered. Tizen posted some numbers
>>> with a much older version of the code, before David fixed a bunch of
>>> issues that he and you found, and that averaged between 25-50% faster.
>>> Details are in this presentation:
>>> http://download.tizen.org/misc/media/conference2014/slides/tdc2014-kdbus-in-tizen3.pdf
>>>
>>> The Tizen and GENIVI developers are off running numbers with the latest
>>> code, or so they told me through emails, but I don't know when/if that
>>> will ever happen, so I can't promise more than what is already here.
>>>
>>
>> I'm working on kdbus support for GLib ([1],[2]). I saw some questions
>> about kdbus performance, so I've prepared simple benchmark. Because
>> David already has posted some comparison results between kdbus and UDS,
>> I've decided to use my GLib port with native kdbus support (it should
>> be noted, that this port is not finished yet and there are still some
>> places for improvements, thus please do not treat these test results as
>> final).
>>
>> To perform tests I've created two simple apps:
>>
>> - server: http://fpaste.org/215157/
>> - client: http://fpaste.org/215156/
>
> After some fiddling (what's this G_BUS_TYPE_USER thing?
> G_BUS_TYPE_SESSION seems to work), I got this to run.
>
As it was discussed some time ago with GLib developers, we have a plan
to create two new bus types called "user" (G_BUS_TYPE_USER) and
"machine" (G_BUS_TYPE_MACHINE). These ones try to connect to kdbus,
falling back to the old "session" and "system" buses (respectively) in
case of failure. The "session" (G_BUS_TYPE_SESSION) and "system"
(G_BUS_TYPE_SYSTEM) bus types always connect to the dbus socket, as
today.
> Can we please take a big step back from the kernel-vs-userspace debate
> here? Can we try to understand why it's so bloody slow before
> thinking about merging something that might ossify it?
>
> With a slightly improved test (it shows how many calls happened) and
> userspace dbus, I got:
>
> $ taskset -c 0 dbus-launch bash
> $ perf stat ./test.sh
> HANDLER: bus_acquired_handler
> HANDLER: name_acquired_handler
> 20000 calls in 4.978810 s = 0.248940 ms per call
>
> Performance counter stats for './test.sh':
>
> 3866.770672 task-clock (msec) # 0.738 CPUs utilized
> 300,633 context-switches # 0.078 M/sec
> 3 cpu-migrations # 0.001 K/sec
> 817 page-faults # 0.211 K/sec
> 13,572,244,134 cycles # 3.510 GHz
> [83.26%]
> 8,799,771,026 stalled-cycles-frontend # 64.84% frontend
> cycles idle [82.75%]
> 6,914,976,524 stalled-cycles-backend # 50.95% backend
> cycles idle [68.92%]
> 9,915,362,980 instructions # 0.73 insns per cycle
> # 0.89 stalled cycles
> per insn [84.82%]
> 2,343,233,242 branches # 605.992 M/sec
> [82.25%]
> 44,699,493 branch-misses # 1.91% of all
> branches [82.83%]
>
> 5.238823116 seconds time elapsed
>
>
> That's more than 15 context switches per call. Something is severely
> broken. There should be almost exactly four context switches per
> call.
>
> NB: subtract 250ms from the elapsed time. test.sh contains a sleep
> 0.25 to work around a stupid race.
>
> This performance is absolutely terrible. Kdbus should be 100-1000x
> faster, not 2x faster, so kdbus' performance is still absolutely
> terrible.
>
> Sure, we can crank the message size so high that the only thing that
> matters is the per-byte overhead, in which case anything that uses
> memfd will win, but dbus-daemon could pretty easily do that to.
>
> Anyway, here's part of the problem. The client does this for each
> message (I added the nanosleep for this test so I could see what was
> going on):
>
> [pid 29592] eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 6
> [pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29592] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6,
> revents=POLLIN}])
> [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000) = 1 ([{fd=6,
> revents=POLLIN}])
> [pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8
> [pid 29592] poll([{fd=6, events=POLLIN}], 1, 25000 <unfinished ...>
> [pid 29593] <... poll resumed> ) = 1 ([{fd=5, revents=POLLIN}])
> [pid 29593] read(5, "\1\0\0\0\0\0\0\0", 16) = 8
> [pid 29593] sendmsg(4, {msg_name(0)=NULL,
> msg_iov(1)=[{"l\1\0\1\5\4\0\0=\4\0\0e\0\0\0\10\1g\0\1s\0\0\1\1o\0\r\0\0\0"...,
> 1149}], msg_controllen=0, msg_flags=0}, MSG_NOSIGNAL) = 1149
> [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
> 4294967295) = 1 ([{fd=4, revents=POLLIN}])
> [pid 29593] read(5, 0x7f9115edfcf0, 16) = -1 EAGAIN (Resource
> temporarily unavailable)
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] recvmsg(4, {msg_name(0)=NULL,
> msg_iov(1)=[{"l\2\1\1\7\0\0\0>\4\0\0-\0\0\0", 16}], msg_controllen=0,
> msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
> [pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 1 ([{fd=4, revents=POLLIN}])
> [pid 29593] recvmsg(4, {msg_name(0)=NULL,
> msg_iov(1)=[{"\10\1g\0\1s\0\0\5\1u\0=\4\0\0\6\1s\0\4\0\0\0:1.1\0\0\0\0"...,
> 55}], msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC)
> = 55
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] poll([{fd=5, events=POLLIN}], 1, 0) = 1 ([{fd=5, revents=POLLIN}])
> [pid 29593] write(6, "\1\0\0\0\0\0\0\0", 8 <unfinished ...>
> [pid 29592] <... poll resumed> ) = 1 ([{fd=6, revents=POLLIN}])
> [pid 29592] futex(0x1f5c130, FUTEX_WAIT, 2, NULL <unfinished ...>
> [pid 29593] <... write resumed> ) = 8
> [pid 29593] futex(0x1f5c130, FUTEX_WAKE, 1 <unfinished ...>
> [pid 29592] <... futex resumed> ) = 0
> [pid 29592] read(6, "\1\0\0\0\0\0\0\0", 16) = 8
> [pid 29592] futex(0x1f5c130, FUTEX_WAKE, 1) = 0
> [pid 29592] write(6, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29592] futex(0x1f5c140, FUTEX_WAKE, 2147483647) = 0
> [pid 29592] nanosleep({0, 100000}, <unfinished ...>
> [pid 29593] <... futex resumed> ) = 1
> [pid 29593] close(6) = 0
> [pid 29593] poll([{fd=4, events=POLLIN}], 1, 0) = 0 (Timeout)
> [pid 29593] write(5, "\1\0\0\0\0\0\0\0", 8) = 8
> [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
> 4294967295) = 1 ([{fd=5, revents=POLLIN}])
> [pid 29593] read(5, "\4\0\0\0\0\0\0\0", 16) = 8
> [pid 29593] poll([{fd=5, events=POLLIN}, {fd=4, events=POLLIN}], 2,
> 4294967295 <unfinished ...>
> [pid 29592] <... nanosleep resumed> NULL) = 0
>
> Hmm. So you're closing an fd for each message, thereby forcing an RCU
> cleanup, and you're doing some kind of inter-thread communication
> using several poll calls and a few futex wakes.
>
> Can someone who likes kdbus please benchmark kdbus against something
> that doesn't go out of its way to be inefficient? I don't even want
> to review kernel code that has as a major claim to fame the ability to
> beat this mess by a mere factor of two.
>
> For comparison, I have some code that uses Thrift, which is dog-slow
> [1], to serialize a message, send it, and deserialize it on the other
> end. The entire process takes 28 microseconds on average. That would
> be almost exactly three orders of magnitude faster. Of course, I'm
> not comparing apples to apples here, but the species of fruit being
> compared does not justify a three order of magnitude difference.
>
> [1] For all I know, the glib serialization stuff is also dog-slow, or
> perhaps snail-slow. Nonetheless, Thrift, or at least the version I'm
> using, is not a shining example of performance. If you want
> serialization performance, use Cap'n Proto.
>
> --Andy
>
--
Lukasz Skalski
Samsung R&D Institute Poland
Samsung Electronics
l.skalski@...sung.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists