linux-kernel - Re: kdbus: to merge or not to merge?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20150706211814.GA12061@molukki>
Date:	Tue, 7 Jul 2015 00:18:14 +0300
From:	"Kalle A. Sandstrom" <ksandstr@....fi>
To:	David Herrmann <dh.herrmann@...il.com>
Cc:	linux-kernel <linux-kernel@...r.kernel.org>
Subject: Re: kdbus: to merge or not to merge?

On Wed, Jul 01, 2015 at 06:51:41PM +0200, David Herrmann wrote:
> Hi
> 

Thanks for the answers; in response I've got some further questions. Again,
apologies for length -- I apparently don't know how to discuss IPC tersely.

> On Wed, Jul 1, 2015 at 2:03 AM, Kalle A. Sandstrom <ksandstr@....fi> wrote:
> > For the first, compare unix domain sockets (i.e. point-to-point mode, access
> > control through filesystem [or fork() parentage], read/write/select) to the
> > kdbus message-sending ioctl. In the main data-exchanging portion, the former
> > requires only a connection identifier, a pointer to a buffer, and the length
> > of data in that buffer. To contrast, kdbus takes a complex message-sending
> > command structure with 0..n items of m kinds that the ioctl must parse in a
> > m-way switching loop, and then another complex message-describing structure
> > which has its own 1..n items of another m kinds describing its contents,
> > destination-lookup options, negotiation of supported options, and so forth.
> 
> sendmsg(2) uses a very similar payload to kdbus. send(2) is a shortcut
> to simplify the most common use-case. I'd be more than glad to accept
> patches adding such shortcuts to kdbus, if accompanied by benchmark
> numbers and reasoning why this is a common path for dbus/etc. clients.
> 

A shortcut special case for e.g. only iovec-like payload items, only to a
numerically designated peer, and only RPC forms, should be an immediate gain
given that reduced functionality would lower the number of instructions
executed, the number of impredictable branches met, and the number of
possibly-cold cache lines accessed.

The difference in raw cycles should be significant in comparison to the
number of kernel exits avoided during a client's RPC to a service and the
associated reply. Assuming that such RPCs are the bulk of what kdbus will
do, and that c/s avoidance is crucial to the performance argument in its
design, it seems silly not to have such a fast-path -- even if it is
initially implemented as a simple wrapper of the full send ioctl.

It would also put the basic send operation on par with sendmsg(2) over a
connected socket in terms of interface complexity, and simplify any future
"exit into peer without scheduler latency" shenanigans.

However, these gains would go unobserved in code written to the current
kdbus ABI. Bridging to such a fast-path from the full interface would
eliminate most of its benefits while hurting its legit callers.

That being said, considering that the eventual de-facto user API to kdbus is
a library with explicit deserialization, endianness conversion, and
suchlike, I could see how the difference would go unobserved.


> The kdbus API is kept generic and extendable, while trying to keep
> runtime overhead minimal. If this overhead turns out to be a
> significant runtime slowdown (which none of my benchmarks showed), we
> should consider adding shortcuts. Until then, I prefer an API that is
> consistent, easy to extend and flexible.
> 

Out of curiosity, what payload item types do you see being added in the near
future, e.g. the next year? UDS knows only of simple buffers, scatter/gather
iovecs, and inter-process dup(2); and recent Linux adds sourcing from a file
descriptor. Perhaps a "pass this previously-received message on" item?


> > Consequently, a carefully optimized implementation of unix domain sockets (and
> > by extension all the data-carrying SysV etc. IPC primitives, optimized
> > similarly) will always be superior to kdbus for both message throughput and
> > latency, [...]
> 
> Yes, that's due to the point-to-point nature of UDS.
> 

Does this change for broadcast, unassociated, or doubly-addressed[0]
operation? For the first, kdbus must already cause allocation of cache lines
in proportion to msg_length * n_recvrs, which mutes the broker's single-copy
advantage as the number of receivers grows. For the second, name lookup from
(say) a hash table only adds to required processing, though the resulting
identifier could be re-used immediately afterward; and the third mode would
prohibit that optimization altogether.

Relatedly, is there publicly-available data concerning the distribution of
various dbus IPC modalities? Such as a desktop booting under systemd,
running for a decent bit, and shutting down; or the automotive industry's
(presumably signaling-heavy) use cases which I've heard quoted for a figure
of 600k transactions before steady state.


> > [...] For long messages (> L1 cache size per Stetson-Harrison[0]) the
> > only performance benefit from kdbus is its claimed single-copy mode of
> > operation-- an equivalent to which could be had with ye olde sockets by copying
> > data from the writer directly into the reader while one of them blocks[1] in
> > the appropriate syscall. That the current Linux pipes, SysV queues, unix domain
> > sockets, etc. don't do this doesn't really factor in.
> 
> Parts of the network subsystem have supported single-copy (mmap'ed IO)
> for quite some time. kdbus mandates it, but otherwise is not special
> in that regard.
> 

I'm not intending to discuss mmap() tricks, but rather that with the
existing system calls a pending write(2) would be made to substitute for the
in-kernel buffer where the corresponding read(2) grabs its bytes from; or
vice versa. This'd make conventional IPC single-copy while permitting the
receiver to designate an arbitrary location for its data; e.g. an IPC daemon
first reading a message header from a sender's socket, figuring out its
routing and allocation, and then receiving the body directly into the
destination shm pool.

That's not directly related to kdbus, except as a hypothetical [transparent]
speed-up for a strictly POSIX user-space reimplementation using the same
mmap()'d shm-pool receive semantics.


[snipped the bit about sender's buffer-full handling]

> > For broadcast messaging, a recipient may observe that messages were dropped by
> > looking at a `dropped_msgs' field delivered (and then reset) as part of the
> > message reception ioctl. Its value is the number of messages dropped since last
> > read, so arguably a client could achieve the equivalent of the condition's
> > absence by resynchronizing explicitly with all signal-senders on its current
> > bus wrt which it knows the protocol, when the value is >0. This method could in
> > principle apply to 1-to-1 unidirectional messaging as well[2].
> 
> Correct.
> 

As a followup question, does the synchronous RPC mode also return
`dropped_msgs'? If so, does it reset the counter? Such operation would seem
to complicate all RPC call sites, and I didn't find it discussed in the
in-tree kdbus documentation.


[case of RPC client timeout/interrupt before service reply]
> If sending a reply fails, the kdbus_reply state is destructed and the
> caller must be woken up. We do that for sync-calls just fine, but the
> async case does indeed lack a wake-up in the error path. I noted this
> down and will fix it.
> 

What's the reply sender's error code in this case, i.e. failure due to
caller having bowed out? The spec suggests either ENXIO (for a disappeared
client), ECONNRESET (for a deactivated client connection), or EPIPE (by
reflection to the service from how it's described in kdbus.message.xml).

(Also, I'm baffled by the difference between ENXIO and ECONNRESET. Is there
a circumstance where a kdbus application would care of the difference
between a peer's not having been there to begin with, and its connection not
being active?)


[snipped speculation about out-of-pool reply delivery, which doesn't happen]

> > The second problem is that given how there can be a timeout or interrupt on the
> > receive side of a "method call" transaction, it's possible for the requestor to
> > bow out of the IPC flow _while the service is processing its request_. This
> > results either in the reply message being lost, or its ending up in the
> > requestor's buffer to appear in a loop where it may not be expected. Either
> 
> (for completeness: we properly support resuming interrupted sync-calls)
> 

How does the client do this? A quick grep through the docs didn't show any
hits for "resume".

Moreover, can the client resume an interrupted sync-call equivalently both
before and after the service has picked the request up, including the
service's call to release the shm-pool allocation (and possibly also the
kdbus tracking structures)? That's to say: is there any case where
non-idempotent RPCs, or their replies, would end up duplicated due to
interrupts or timeouts?


> > way, the client must at that point resynchronize wrt all objects related to the
> > request's side effects, or abandon the IPC flow entirely and start over.
> > (services need only confirm their replies before effecting e.g. a chardev-like
> > "destructively read N bytes from buffer" operation's outcome, which is slightly
> > less ugly.)
> 
> Correct. If you time-out, or refuse to resume, a sync-call, you have
> to treat this transaction as failed.
> 

More generally speaking, what's the use case for having a timeout? What's a
client to do?

Given that a RPC client can, having timed out, either resume (having e.g.
approximated coöperative multitasking in between? [surely not?]) or abort
(to try it over again?), I don't understand why this API is there in the
first place. Unless it's (say) a reheating of the poor man's distributed
deadlock detection, but that's a bit of a nasty way of looking at it-- RPC
dependencies should be acyclic anyhow.

For example let's assume a client and two services, one being what the
client calls into and the other what the first uses to implement some
portion of its interface. The first does a few things and then calls into
the second before replying, which in its service code ends up incurring a
delay from sleeping on a mutex, waiting for disk/network I/O, or local
system load. The first service does not specify a timeout, but the
chain-initiating client does, and this timeout ends up being reached due to
delay in the second service. (notably, the timeout will not have occurred
due to a deadlock and so cannot be resolved by the intermediate chain
releasing mutex'd primitives and starting over.)

How does the client recover from the timeout? Are intermediary services
required to exhibit composable idempotence? Is there a greater transaction
bracket around a client RPC, so that rollback/commit can happen regardless
of intermediaries?


> > Tying this back into the first point: to prevent this type of denial-of-service
> > against sanguinely-written software it's necessary for kdbus to invoke the
> > policy engine to determine that an unrelated participant isn't allowed to
> > consume a peer's buffer space.
> 
> It's not the policy engine, but quota-handling, but otherwise correct.
> 

I have some further questions on the topic of shm pool semantics. (I'm
trying to figure out what a robust RPC client call-site would look like, as
that is part of kdbus' end-to-end performance.)

Does the client receive a status code if a reply fails due to the quota
mechanism? This, again, I didn't find in the spec.

Is there some way for a RPC peer to know that replies below a certain size
will be delivered regardless of the state of the client's shm pool at the
time of reply? Such as a per-connection parameter (i.e. one that's
implicitly a part of the protocol) or a per-RPC field that a client may set
to achieve reliable operation without "no reply because buffer full"
handling even in the face of concurrent senders.

Does the client receive scattered data if its reception pool has enough room
for a reply message, but the largest piece is smaller than the reply
payload? If so, is there some method by which a sender could indicate
legitimate breaks in the message contents, e.g. between strings or integers,
so that a middleware (IDL compiler, basically) could wrap that data into a
function call's out-parameters without doing an extra gather stage [copy] in
userspace? If not, must a client process call into the message-handling part
of its main loop (to release shmpool space by handling messages) whenever a
reply fails for this reason?


Interestedly,
 -KS

[0] I'm referring to the part where a send operation (or the message) may
    specify both numeric recipient ID and name, which kdbus would match
    or reject the message.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/