[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <mhng-97fc5874-29d0-4d9e-8c92-d3704a482f28@palmerdabbelt-glaptop1>
Date: Mon, 07 Dec 2020 10:55:57 -0800 (PST)
From: Palmer Dabbelt <palmer@...belt.com>
To: Christoph Hellwig <hch@...radead.org>
CC: dm-devel@...hat.com, agk@...hat.com, snitzer@...hat.com,
corbet@....net, song@...nel.org, shuah@...nel.org,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
linux-raid@...r.kernel.org, linux-kselftest@...r.kernel.org,
kernel-team@...roid.com
Subject: Re: [PATCH v1 0/5] dm: dm-user: New target that proxies BIOs to userspace
On Fri, 04 Dec 2020 02:33:36 PST (-0800), Christoph Hellwig wrote:
> What is the advantage over simply using nbd?
There's a short bit about that in the cover letter (and in some talks), but
I'll expand on it here -- I suppose my most important question is "is this
interesting enough to take upstream?", so there should be at least a bit of a
description of what it actually enables:
I don't think there's any deep fundamental advantages to doing this as opposed
to nbd/iscsi over localhost/unix (or by just writing a kernel implementation,
for that matter), at least in terms of anything that was previously impossible
now becoming possible. There are a handful of things that are easier and/or
faster, though.
dm-user looks a lot like NBD without the networking. The major difference is
which side initiates messages: in NBD the kernel initiates messages, while in
dm-user userspace initiates messages (via a read that will block if there is no
message, but presumably we'd want to add support for a non-blocking userspace
implementations eventually). The NBD approach certainly makes sense for a
networked system, as one generally wants to have a single storage server
handling multiple clients, but inverting that makes some things simpler in
dm-user.
One specific advantage of this change is that a dm-user target can be
transitioned from one daemon to another without any IO errors: just spin up the
second daemon, signal the first to stop requesting new messages, and let it
exit. We're using that mechanism to replace the daemon launched by early init
(which runs before the security subsystem is up, as in our use case dm-user
provides the root filesystem) with one that's properly sandboxed (which can
only be launched after the root filesystem has come up). There are ways around
this (replacing the DM table, for example), but they don't fit it as cleanly.
Unless I'm missing something, NBD servers aren't capable of that style of
transition: soft disconnects can only be initiated by the client (the kernel,
in this case), which leaves no way for the server to transition while
guaranteeing that no IOs error out. It's usually possible to shoehorn this
sort of direction reversing concept into network protocols, but it's also
usually ugly (I'm thinking of IDLE, for example). I didn't try to actually do
it, but my guess would be that adding a way for the server to ask the client to
stop sending messages until a new server shows up would be at least as much
work as doing this.
There are also a handful of possible performance advantages, but I haven't gone
through the work to prove any of them out yet as performance isn't all that
important for our first use case. For example:
* Cutting out the network stack is unlikely to hurt performance. I'm not sure
if it will help performance, though. I think if we really had workload where
the extra copy was likely to be an issue we'd want an explicit ring buffer,
but I have a theory that it would be possible to get very good performance out
of a stream-style API by using multiple channels and relying on io_uring to
plumb through multiple ops per channel.
* There's a comment in the implementation about allowing userspace to insert
itself into user_map(), likely by uploading a BPF fragment. There's a whole
class of interesting block devices that could be written in this fashion:
essentially you keep a cache on a regular block device that handles the common
cases by remapping BIOs and passing them along, relegating the more complicated
logic to fetch cache misses and watching some subset of the access stream where
necessary.
We have a use case like this in Android, where we opportunistically store
backups in a portion of the TRIM'd space on devices. It's currently
implemented entirely in kernel by the dm-bow target, but IIUC that was deemed
too Android-specific to merge. Assuming we could get good enough performance
we could move that logic to userspace, which lets us shrink our diff with
upstream. It feels like some other interesting block devices could be
written in a similar fashion.
All in all, I've found it a bit hard to figure out what sort of interest people
have in dm-user: when I bring this up I seem to run into people who've done
similar things before and are vaguely interested, but certainly nobody is
chomping at the bit. I'm sending it out in this early state to try and figure
out if it's interesting enough to keep going.
Powered by blists - more mailing lists