lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150701000358.GA32283@molukki>
Date:	Wed, 1 Jul 2015 03:03:58 +0300
From:	"Kalle A. Sandstrom" <ksandstr@....fi>
To:	linux-kernel@...r.kernel.org
Subject: Re: kdbus: to merge or not to merge?


[delurk; apparently kdbus is not receiving the architectural review it should.
i've got quite a bit of knowledge on message-passing mechanisms in general, and
kernel IPC in particular, so i'll weigh in uninvited. apologies for length.

as my "proper" review on this topic is still under construction, i'll try (and
fail) to be brief here. i started down that road only to realize that kdbus is
quite the ball of mud even if the only thing under the scope is its interface,
and that if i held off until properly ready i'd risk kdbus having already been
merged, making review moot.]


Ingo Molnar wrote:

>- I've been closely monitoring Linux kernel changes for over 20 years, and for the
>  last 10 years the linux/ipc/* code has been dormant: it works and was kept good
>  for existing usecases, but no-one was maintaining and enhancing it with the
>  future in mind.

It's my understanding that linux/ipc/* contains only SysV IPC, i.e. shm, sem,
SysV message queues, and POSIX message queues. There are other IPC-implementing
things in the kernel also, such as unix domain sockets, pipes, shared memory
via mmap(), signals, mappings that appear shared across fork(), and whatever
else provides either kernel-mediated multi-client buffer access or some
combination of shared memory and synchronization that lets userspace exchange
hot data across the address space boundary.

It's also my understanding that no-one in their right mind would call SysV IPC
state-of-the-art even at the level of interface; indeed its presence in the
hoariest of vendor unixes suggests it's not supposed to be even close.

However, the suggested replacement in kdbus replicates the worst[-1] of all
known user-to-user IPC mechanisms, i.e. Mach. I'm not suggesting that Linux
adopt e.g. a different microkernel IPC mechanism-- those are by and large
inapplicable to a monolithic kernel for reasons of ABI (and, well, why would
you do IPC when function calls are zomgfast already?)-- but rather, that the
existing ones either are good enough at this time or can be reworked to become
near-equivalent to the state of the art in terms of performance.


>  So there exists a technical vacuum: the kernel does not have any good, modern
>  IPC ABI at the moment that distros can rely on as a 'golden standard'. This is
>  partly technical, partly political. The technical reason is that SysV IPC is
>  ancient and cumbersome. The political reason is that SystemD could be using
>  and extending Android's existing kernel accelerated IPC subsystem (Binder)
>  that is already upstream - but does not.

I'll contend that the reason for this vacuum is that the existing kernel IPC
interfaces are fine to the point that other mechanisms may be derived from
them solely in user-space without significant performance demerit, and without
pushing ca. 10k SLOC of IPC broker and policy engine into kernel space.

Furthermore, it's my well-ruminated opinion that implementations of the
userspace ABI specified in the kdbus 4.1-rc1 version (of April this year) will
always be necessarily slower than existing IPC primitives in terms of both
throughput and latency; and that the latter are directly applicable to
constructing a more convenient user-space IPC broker that implements what
kdbus seeks to provide: naming, broadcast, unidirectional signaling,
bidirectional "method calls", and a policy mechanism.

In addition I'll argue that as currently specified, the kdbus interface-- even
if tuned to its utmost-- is not only necessarily inferior to e.g. a well-tuned
version of unix domain sockets, but also fundamentally flawed in ways that
prohibit construction of robust in-system distributed programs by kdbus'
mechanisms alone (i.e. byzantine call-site workarounds notwithstanding).


For the first, compare unix domain sockets (i.e. point-to-point mode, access
control through filesystem [or fork() parentage], read/write/select) to the
kdbus message-sending ioctl. In the main data-exchanging portion, the former
requires only a connection identifier, a pointer to a buffer, and the length
of data in that buffer. To contrast, kdbus takes a complex message-sending
command structure with 0..n items of m kinds that the ioctl must parse in a
m-way switching loop, and then another complex message-describing structure
which has its own 1..n items of another m kinds describing its contents,
destination-lookup options, negotiation of supported options, and so forth.

Consequently, a carefully optimized implementation of unix domain sockets (and
by extension all the data-carrying SysV etc. IPC primitives, optimized
similarly) will always be superior to kdbus for both message throughput and
latency, for the reason of kdbus' comparatively great interface complexity
alone.

There's an obvious caveat here, i.e. "well where is it, then?". Given the
overhead dictated by its interface, kdbus' performance is already inferior for
short messages. For long messages (> L1 cache size per Stetson-Harrison[0]) the
only performance benefit from kdbus is its claimed single-copy mode of
operation-- an equivalent to which could be had with ye olde sockets by copying
data from the writer directly into the reader while one of them blocks[1] in
the appropriate syscall. That the current Linux pipes, SysV queues, unix domain
sockets, etc. don't do this doesn't really factor in.


For the second, kdbus is fundamentally designed to buffer message data, up to
a fixed limit, in the pool associated with receivers' connections. I cannot
overstate the degree of this _outright architectural blunder_, so I'll put an
extra paragraph break here just for emphasis' sake.

A consequence of this buffering is that whenever a client sends a message with
kdbus, it must be prepared to handle an out-of-space non-delivery status.
(kdbus has two of those, one for queue length and another for buffer space.
why, i have no idea-- do clients have a different behaviour in response to one
of them from the other?) There's no option to e.g. overwrite a previous
message, or to discard queued messages in an oldest-first order, instead of
rebuffing the sender.

For broadcast messaging, a recipient may observe that messages were dropped by
looking at a `dropped_msgs' field delivered (and then reset) as part of the
message reception ioctl. Its value is the number of messages dropped since last
read, so arguably a client could achieve the equivalent of the condition's
absence by resynchronizing explicitly with all signal-senders on its current
bus wrt which it knows the protocol, when the value is >0. This method could in
principle apply to 1-to-1 unidirectional messaging as well[2].

Looking at the kdbus "send message, wait for tagged reply" feature in
conjunction with these details appears to reveal two holes in its state graph.
The first is that if replies are delivered through the requestor's buffer,
concurrent sends into that same buffer may cause it to become full (or the
queue to grow too long, w/e) before the service gets a chance to reply. If this
condition causes a reply to fall out of the IPC flow, the requestor will hang
until either its specified timeout happens or it gets interrupted by a signal.
If replies are delivered outside the shm pool, the requestor must be prepared
to pick them up using a different means from the "in your pool w/ offset X,
length Y" format the main-line kdbus interface provides. [i've seen no such
thing in the kdbus docs so far.]

As far as alternative solutions go, preallocation of space for a reply message
is an incomplete fix unless every reply's size has a known upper bound (e.g.
with use of an IDL compiler); in this scheme it'd be necessary for the
requestor to specify this, suffering consequences if the number is too low, and
to prepare to handle a "not enough buffer space for a reply" condition at send.
The kdbus docs specify no such condition.

The second problem is that given how there can be a timeout or interrupt on the
receive side of a "method call" transaction, it's possible for the requestor to
bow out of the IPC flow _while the service is processing its request_. This
results either in the reply message being lost, or its ending up in the
requestor's buffer to appear in a loop where it may not be expected. Either
way, the client must at that point resynchronize wrt all objects related to the
request's side effects, or abandon the IPC flow entirely and start over.
(services need only confirm their replies before effecting e.g. a chardev-like
"destructively read N bytes from buffer" operation's outcome, which is slightly
less ugly.)


Tying this back into the first point: to prevent this type of denial-of-service
against sanguinely-written software it's necessary for kdbus to invoke the
policy engine to determine that an unrelated participant isn't allowed to
consume a peer's buffer space. As this operation is absent in unix-domain
sockets, an ideal implementation of kdbus 4.1-rc1 will be slower in
point-to-point communication even if the particulars of its message-descriptor
format get reworked to a light-weight alternative. In addition, its API ends up
requiring highly involved state-tracking wrappers or inversion-of-control
machinery in its clients, to the point where just using unix domain sockets
with a heavyweight user-space broker would be nicer.


It's my opinionated conclusion that merging kdbus as-is would be the sort of
cock-up which we'll look back at, point a finger, giggle a bit, and wonder only
half-jokingly if there was something besides horse bones in that glue. Its docs
betray an absence of careful analysis, and the spec of its interface is so
loose as to make programs written for kdbus 4.1-rc1 subtly incompatible to any
later program through deeply-baked design consequences stemming from quirks of
its current implementation.

I'm not a Linux kernel developer. But if I were, this would be where I'd put
my NAK.


Sincerely,
  -KS

[-1] author's opinion
[0] no bunny rabbits were harmed
[1] the case where both use non-blocking I/O requires either a buffer or
    support from the scheduler. the former is no optimization at all, and the
    latter may be _quite involved indeed_.
[2] as for whether freedesktop.org programs will be designed and built to such
    a standard, i suspend judgement.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ