[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <r4njpmpw4mnkzt6msn6k523dcagoi7gulhbvanpht26b3lpvtm@7oroy3y2dr2c>
Date: Tue, 22 Apr 2025 06:50:25 -0500
From: John Groves <John@...ves.net>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Dan Williams <dan.j.williams@...el.com>,
Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bschubert@....com>,
John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>,
Vishal Verma <vishal.l.verma@...el.com>, Dave Jiang <dave.jiang@...el.com>,
Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>,
Luis Henriques <luis@...lia.com>, Randy Dunlap <rdunlap@...radead.org>,
Jeff Layton <jlayton@...nel.org>, Kent Overstreet <kent.overstreet@...ux.dev>,
Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>, Stefan Hajnoczi <shajnocz@...hat.com>,
Joanne Koong <joannelkoong@...il.com>, Josef Bacik <josef@...icpanda.com>,
Aravind Ramesh <arramesh@...ron.com>, Ajay Joshi <ajayjoshi@...ron.com>
Subject: Re: [RFC PATCH 00/19] famfs: port into fuse
On 25/04/21 06:25PM, Darrick J. Wong wrote:
> On Mon, Apr 21, 2025 at 05:00:35PM -0500, John Groves wrote:
> > On 25/04/21 11:27AM, Darrick J. Wong wrote:
> > > On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > > > Subject: famfs: port into fuse
> > > >
> > > > This is the initial RFC for the fabric-attached memory file system (famfs)
> > > > integration into fuse. In order to function, this requires a related patch
> > > > to libfuse [1] and the famfs user space [2].
> > > >
> > > > This RFC is mainly intended to socialize the approach and get feedback from
> > > > the fuse developers and maintainers. There is some dax work that needs to
> > > > be done before this should be merged (see the "poisoned page|folio problem"
> > > > below).
> > >
> > > Note that I'm only looking at the fuse and iomap aspects of this
> > > patchset. I don't know the devdax code at all.
> > >
> > > > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > > > smoke and unit tests -- and I encourage existing famfs users to test it.
> > > >
> > > > This is really two patch sets mashed up:
> > > >
> > > > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> > > > devdax to host an fs-dax file system.
> > > > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> > > > unchanged since last year.
> > > >
> > > > Because this is not ready to merge yet, I have felt free to leave some debug
> > > > prints in place because we still find them useful; those will be cleaned up
> > > > in a subsequent revision.
> > > >
> > > > Famfs Overview
> > > >
> > > > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > > > from dax devices, and provides memory-mappable files that map directly to
> > > > the memory - no page cache involvement. Famfs differs from conventional
> > > > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > > > sharable way (which begins with never caching dirty shared metadata).
> > > >
> > > > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > > > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > > > public evidence that I've been working on that.
> > >
> > > This is very timely, as I just started looking into how I might connect
> > > iomap to fuse so that most of the hot IO path continues to run in the
> > > kernel, and userspace block device filesystem drivers merely supply the
> > > file mappings to the kernel. In other words, we kick the metadata
> > > parsing craziness out of the kernel.
> >
> > Coool!
> >
> > >
> > > > The key performance requirement is that famfs must resolve mapping faults
> > > > without upcalls. This is achieved by fully caching the file-to-devdax
> > > > metadata for all active files. This is done via two fuse client/server
> > > > message/response pairs: GET_FMAP and GET_DAXDEV.
> > >
> > > Heh, just last week I finally got around to laying out how I think I'd
> > > want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> > > upcalls to a fuse server. Note that I've done zero prototyping but
> > > "upload all the mappings at open time" seems like a reasonable place for
> > > me to start looking, especially for a filesystem with static mappings.
> > >
> > > I think what I want to try to build is an in-kernel mapping cache (sort
> > > of like the one you built), only with upcalls to the fuse server when
> > > there is no mapping information for a given IO. I'd probably want to
> > > have a means for the fuse server to put new mappings into the cache, or
> > > invalidate existing mappings.
> > >
> > > (famfs obviously is a simple corner-case of that grandiose vision, but I
> > > still have a long way to get to my larger vision so don't take my words
> > > as any kind of requirement.)
> > >
> > > > Famfs remains the first fs-dax file system that is backed by devdax rather
> > > > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > > >
> > > > Notes
> > > >
> > > > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> > > > virtiofs to update to use the improved interface.
> > > >
> > > > * I'm currently maintaining compatibility between the famfs user space and
> > > > both the standalone famfs kernel file system and this new fuse
> > > > implementation. In the near future I'll be running performance comparisons
> > > > and sharing them - but there is no reason to expect significant degradation
> > > > with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> > >
> > > I'm curious to hear what you find, performance-wise. :)
> > >
> > > > faults with no upcalls. This patch has a bit too much debug turned on to
> > > > to that testing quite yet. A branch
> > >
> > > A branch ... what?
> >
> > I trail off sometimes... ;)
> >
> > >
> > > > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > > >
> > > > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> > > > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> > > > allowing the fuse/famfs kernel code to handle read/write/fault without any
> > > > upcalls.
> > >
> > > Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> > > mappings into the kernel.
> >
> > That may be a better approach. Miklos and I discussed it during LPC last year,
> > and thought both were options. Having implemented it at LOOKUP time, I think
> > moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
> > mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
> > payload. Moving GET_FMAP to open time, would break that connection in a good
> > way, I think.
>
> I wonder if we could just add a couple new "notification" types so that
> the fuse server can initiate uploads of mappings whenever it feels like
> it. For your usage model I don't think it'll make much difference since
> they seem pretty static, but the ability to do that would open up some
> flexibility for famfs. The more general filesystems will need it
> anyway, and someone's going to want to truncate a famfs file. They
> always do. ;)
>
> > >
> > > > * After each GET_FMAP, the fmap is checked for extents that reference
> > > > previously-unknown daxdevs. Each such occurence is handled with a
> > > > GET_DAXDEV message and response.
> > >
> > > I hadn't figured out how this part would work for my silly prototype.
> > > Just out of curiosity, does the famfs fuse server hold an open fd to the
> > > storage, in which case the fmap(ping) could just contain the open fd?
> > >
> > > Where are the mappings that are sent from the fuse server? Is that
> > > struct fuse_famfs_simple_ext?
> >
> > See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation.
> > Famfs currently supports either simple extents (daxdev, offset, length) or
> > interleaved ones (which describe each "strip" as a simple extent). I think
> > the explanation in famfs_kfmap.h is pretty clear.
> >
> > A key question is whether any additional basic metadata abstractions would
> > be needed - because the kernel needs to understand the full scheme.
> >
> > With disaggregated memory, the interleave approach is nice because it gets
> > aggregated performance and resolving a file offset to daxdev offset is order
> > 1.
> >
> > Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
> > The fmaps-in-messages structs are currently in the famfs section of
> > include/uapi/linux/fuse.h. And the in-memory version is in
> > fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
> > (ugh...)
>
> Ok, will take a look tomorrow morning.
>
> > >
> > > > * Daxdevs are stored in a table (which might become an xarray at some point).
> > > > When entries are added to the table, we acquire exclusive access to the
> > > > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> > > > with pmem devices). famfs provides holder_operations to devdax, providing
> > > > a notification path in the event of memory errors.
> > > >
> > > > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> > > > bocks all subsequent accesses to data on that device. The recovery is to
> > > > re-initialize the memory and file system. Famfs is memory, not storage...
> > >
> > > Ouch. :)
> >
> > Cautious initial approach (i.e. I'm trying not to scare people too much ;)
> >
> > >
> > > > * Because famfs uses backing (devdax) devices, only privileged mounts are
> > > > supported.
> > > >
> > > > * The famfs kernel code never accesses the memory directly - it only
> > > > facilitates read, write and mmap on behalf of user processes. As such,
> > > > the RAS of the shared memory affects applications, but not the kernel.
> > > >
> > > > * Famfs has backing device(s), but they are devdax (char) rather than
> > > > block. Right now there is no way to tell the vfs layer that famfs has a
> > > > char backing device (unless we say it's block, but it's not). Currently
> > > > we use the standard anonymous fuse fs_type - but I'm not sure that's
> > > > ultimately optimal (thoughts?)
> > >
> > > Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> > > fuse_args object? fuse2fs does that, though I don't recall if that's a
> > > reasonable thing to do.
> >
> > The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
> > fs_dax_get_by_bdev() and passes in holder_operations - which are used for
> > error upcalls, but also effect exclusive ownership.
> >
> > I added fs_dax_get() since the bdev version wasn't really right or char
> > devdax. But same holder_operations.
> >
> > I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
> > span multiple daxdevs, in order to interleave for performance. The approach
> > of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
> > just amounts to a second way to do the same thing.
>
> Oh, hah, it's a multi-device filesystem. Hee hee hee...
Hee hee indeed. The thing about memory, and dax devices, is that there
isn't anything like device mapper that can make compound or interleaved
devices. There's not a "stop while dma happens" point for swizzling
addresses. I'm down for a discussion about whether there is a viable way
to have a mapper layer, but I also think constructing interleaved objects
as files is quite good - and might be the best solution.
Interleaving is essential to memory performance in general. System-ram is
pretty much never not interleaved. And there are some reasons why programming
the hardware to do the interleaving is gonna be problem for non-static
setups. I'll save going down that rathole for a different time...
John
Powered by blists - more mailing lists