[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJfpegvQefgKOKMWC8qGTDAY=qRmxPvWkg2QKzNUiag1+q5L+Q@mail.gmail.com>
Date: Wed, 22 May 2024 10:58:01 +0200
From: Miklos Szeredi <miklos@...redi.hu>
To: John Groves <John@...ves.net>
Cc: Amir Goldstein <amir73il@...il.com>, John Groves <jgroves@...ron.com>,
Jonathan Corbet <corbet@....net>, Dan Williams <dan.j.williams@...el.com>,
Vishal Verma <vishal.l.verma@...el.com>, Dave Jiang <dave.jiang@...el.com>,
Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, Jan Kara <jack@...e.cz>,
Matthew Wilcox <willy@...radead.org>, linux-cxl@...r.kernel.org,
linux-fsdevel@...r.kernel.org, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, john@...alactic.com,
Dave Chinner <david@...morbit.com>, Christoph Hellwig <hch@...radead.org>, dave.hansen@...ux.intel.com,
gregory.price@...verge.com, Vivek Goyal <vgoyal@...hat.com>
Subject: Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system
On Wed, 22 May 2024 at 04:05, John Groves <John@...ves.net> wrote:
> I'm happy to help with that if you care - ping me if so; getting a VM running
> in EFI mode is not necessary if you reserve the dax memory via memmap=, or
> via libvirt xml.
Could you please give an example?
I use a raw qemu command line with a -kernel option and a root fs
image (not a disk image with a bootloader).
> More generally, a famfs file extent is [daxdev, offset, len]; there may
> be multiple extents per file, and in the future this definitely needs to
> generalize to multiple daxdev's.
>
> Disclaimer: I'm still coming up to speed on fuse (slowly and ignorantly,
> I think)...
>
> A single backing device (daxdev) will contain extents of many famfs
> files (plus metadata - currently a superblock and a log). I'm not sure
> it's realistic to have a backing daxdev "open" per famfs file.
That's exactly what I was saying.
The passthrough interface was deliberately done in a way to separate
the mapping into two steps:
1) registering the backing file (which could be a device)
2) mapping from a fuse file to a registered backing file
Step 1 can happen at any time, while step 2 currently happens at open,
but for various other purposes like metadata passthrough it makes
sense to allow the mapping to happen at lookup time and be cached for
the lifetime of the inode.
> In addition there is:
>
> - struct dax_holder_operations - to allow a notify_failure() upcall
> from dax. This provides the critical capability to shut down famfs
> if there are memory errors. This is filesystem- (or technically daxdev-
> wide)
This can be hooked into fuse_is_bad().
> - The pmem or devdax iomap_ops - to allow the fsdax file system (famfs,
> and [soon] famfs_fuse) to call dax_iomap_rw() and dax_iomap_fault().
> I strongly suspect that famfs_fuse can't be correct unless it uses
> this path rather than just the idea of a single backing file.
Agreed.
> - the dev_dax_iomap portion of the famfs patchsets adds iomap_ops to
> character devdax.
You'll need to channel those patches through the respective
maintainers, preferably before the fuse parts are merged.
> - Note that dax devices, unlike files, don't support read/write - only
> mmap(). I suspect (though I'm still pretty ignorant) that this means
> we can't just treat the dax device as an extent-based backing file.
Doesn't matter, it'll use the iomap infrastructure instead of the
passthrough infrastructure.
But the interfaces for regular passthrough and fsdax could be shared.
Conceptually they are very similar: there's a backing store indexable
with byte offsets.
What's currently missing from the API is an extent list in
fuse_open_out. The format could be:
[ {backing_id, offset, length}, ... ]
allowing each extent to map to a different backing device.
> A dax device to famfs is a lot more like a backing device for a "filesystem"
> than a backing file for another file. And, as previously mentioned, there
> is the iomap_ops interface and the holder_ops interface that deal with
> multiple file tenants on a dax device (plus error notification,
> respectively)
>
> Probably doable, but important distinctions...
Yeah, that's why I suggested to create a new source file for this
within fs/fuse. Alternatively we could try splitting up fuse into
modules (core, virtiofs, cuse, fsdax) but I think that can be left as
a cleanup step.
> First question: can you suggest an example fuse file pass-through
> file system that I might use as a jumping-off point? Something that
> gets the basic pass-through capability from which to start hacking
> in famfs/dax capabilities?
An example is in Amir's libfuse repo at
https://github.com/libfuse/libfuse
> I'm confused by the last item. I would think there would be a fuse
> inode per famfs file, and that multiple of those would map to separate
> extent lists of one or more backing dax devices.
Yeah.
> Or maybe I misunderstand the meaning of "fuse inode". Feel free to
> assign reading...
I think Amir meant that each open file could in theory have a
different mapping. This is allowed by the fuse interface, but is
disallowed in practice.
I'm in favor of caching the extent map so it only has to be given on
the first open (or lookup).
Thanks,
Miklos
Powered by blists - more mailing lists