[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37ss7esexgblholq5wc5caeizhcjpjhjxsghqjtkxjqri4uxjp@gixtdlggap5i>
Date: Mon, 21 Apr 2025 17:00:35 -0500
From: John Groves <John@...ves.net>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Dan Williams <dan.j.williams@...el.com>,
Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bschubert@....com>,
John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>,
Vishal Verma <vishal.l.verma@...el.com>, Dave Jiang <dave.jiang@...el.com>,
Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>,
Luis Henriques <luis@...lia.com>, Randy Dunlap <rdunlap@...radead.org>,
Jeff Layton <jlayton@...nel.org>, Kent Overstreet <kent.overstreet@...ux.dev>,
Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>, linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>, Stefan Hajnoczi <shajnocz@...hat.com>,
Joanne Koong <joannelkoong@...il.com>, Josef Bacik <josef@...icpanda.com>,
Aravind Ramesh <arramesh@...ron.com>, Ajay Joshi <ajayjoshi@...ron.com>
Subject: Re: [RFC PATCH 00/19] famfs: port into fuse
On 25/04/21 11:27AM, Darrick J. Wong wrote:
> On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > Subject: famfs: port into fuse
> >
> > This is the initial RFC for the fabric-attached memory file system (famfs)
> > integration into fuse. In order to function, this requires a related patch
> > to libfuse [1] and the famfs user space [2].
> >
> > This RFC is mainly intended to socialize the approach and get feedback from
> > the fuse developers and maintainers. There is some dax work that needs to
> > be done before this should be merged (see the "poisoned page|folio problem"
> > below).
>
> Note that I'm only looking at the fuse and iomap aspects of this
> patchset. I don't know the devdax code at all.
>
> > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > smoke and unit tests -- and I encourage existing famfs users to test it.
> >
> > This is really two patch sets mashed up:
> >
> > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> > devdax to host an fs-dax file system.
> > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> > unchanged since last year.
> >
> > Because this is not ready to merge yet, I have felt free to leave some debug
> > prints in place because we still find them useful; those will be cleaned up
> > in a subsequent revision.
> >
> > Famfs Overview
> >
> > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > from dax devices, and provides memory-mappable files that map directly to
> > the memory - no page cache involvement. Famfs differs from conventional
> > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > sharable way (which begins with never caching dirty shared metadata).
> >
> > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > public evidence that I've been working on that.
>
> This is very timely, as I just started looking into how I might connect
> iomap to fuse so that most of the hot IO path continues to run in the
> kernel, and userspace block device filesystem drivers merely supply the
> file mappings to the kernel. In other words, we kick the metadata
> parsing craziness out of the kernel.
Coool!
>
> > The key performance requirement is that famfs must resolve mapping faults
> > without upcalls. This is achieved by fully caching the file-to-devdax
> > metadata for all active files. This is done via two fuse client/server
> > message/response pairs: GET_FMAP and GET_DAXDEV.
>
> Heh, just last week I finally got around to laying out how I think I'd
> want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> upcalls to a fuse server. Note that I've done zero prototyping but
> "upload all the mappings at open time" seems like a reasonable place for
> me to start looking, especially for a filesystem with static mappings.
>
> I think what I want to try to build is an in-kernel mapping cache (sort
> of like the one you built), only with upcalls to the fuse server when
> there is no mapping information for a given IO. I'd probably want to
> have a means for the fuse server to put new mappings into the cache, or
> invalidate existing mappings.
>
> (famfs obviously is a simple corner-case of that grandiose vision, but I
> still have a long way to get to my larger vision so don't take my words
> as any kind of requirement.)
>
> > Famfs remains the first fs-dax file system that is backed by devdax rather
> > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> >
> > Notes
> >
> > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> > virtiofs to update to use the improved interface.
> >
> > * I'm currently maintaining compatibility between the famfs user space and
> > both the standalone famfs kernel file system and this new fuse
> > implementation. In the near future I'll be running performance comparisons
> > and sharing them - but there is no reason to expect significant degradation
> > with fuse, since famfs caches entire "fmaps" in the kernel to resolve
>
> I'm curious to hear what you find, performance-wise. :)
>
> > faults with no upcalls. This patch has a bit too much debug turned on to
> > to that testing quite yet. A branch
>
> A branch ... what?
I trail off sometimes... ;)
>
> > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> >
> > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> > GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> > allowing the fuse/famfs kernel code to handle read/write/fault without any
> > upcalls.
>
> Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> mappings into the kernel.
That may be a better approach. Miklos and I discussed it during LPC last year,
and thought both were options. Having implemented it at LOOKUP time, I think
moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
payload. Moving GET_FMAP to open time, would break that connection in a good
way, I think.
>
> > * After each GET_FMAP, the fmap is checked for extents that reference
> > previously-unknown daxdevs. Each such occurence is handled with a
> > GET_DAXDEV message and response.
>
> I hadn't figured out how this part would work for my silly prototype.
> Just out of curiosity, does the famfs fuse server hold an open fd to the
> storage, in which case the fmap(ping) could just contain the open fd?
>
> Where are the mappings that are sent from the fuse server? Is that
> struct fuse_famfs_simple_ext?
See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation.
Famfs currently supports either simple extents (daxdev, offset, length) or
interleaved ones (which describe each "strip" as a simple extent). I think
the explanation in famfs_kfmap.h is pretty clear.
A key question is whether any additional basic metadata abstractions would
be needed - because the kernel needs to understand the full scheme.
With disaggregated memory, the interleave approach is nice because it gets
aggregated performance and resolving a file offset to daxdev offset is order
1.
Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
The fmaps-in-messages structs are currently in the famfs section of
include/uapi/linux/fuse.h. And the in-memory version is in
fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
(ugh...)
>
> > * Daxdevs are stored in a table (which might become an xarray at some point).
> > When entries are added to the table, we acquire exclusive access to the
> > daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> > with pmem devices). famfs provides holder_operations to devdax, providing
> > a notification path in the event of memory errors.
> >
> > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> > bocks all subsequent accesses to data on that device. The recovery is to
> > re-initialize the memory and file system. Famfs is memory, not storage...
>
> Ouch. :)
Cautious initial approach (i.e. I'm trying not to scare people too much ;)
>
> > * Because famfs uses backing (devdax) devices, only privileged mounts are
> > supported.
> >
> > * The famfs kernel code never accesses the memory directly - it only
> > facilitates read, write and mmap on behalf of user processes. As such,
> > the RAS of the shared memory affects applications, but not the kernel.
> >
> > * Famfs has backing device(s), but they are devdax (char) rather than
> > block. Right now there is no way to tell the vfs layer that famfs has a
> > char backing device (unless we say it's block, but it's not). Currently
> > we use the standard anonymous fuse fs_type - but I'm not sure that's
> > ultimately optimal (thoughts?)
>
> Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> fuse_args object? fuse2fs does that, though I don't recall if that's a
> reasonable thing to do.
The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
fs_dax_get_by_bdev() and passes in holder_operations - which are used for
error upcalls, but also effect exclusive ownership.
I added fs_dax_get() since the bdev version wasn't really right or char
devdax. But same holder_operations.
I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
span multiple daxdevs, in order to interleave for performance. The approach
of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
just amounts to a second way to do the same thing.
"But wait"... I thought. Doesn't the "-o" approach get the primary daxdev
locked up sooner, which might be good? Well, no, because famfs creates a
couple of meta files during mount .meta/.superblock and .meta/.log - and
those are guaranteed to reference the primary daxdev. So I concluded the -o
approach wasn't worth the trouble (though it's not *much* trouble).
>
> > The "poisoned page|folio problem"
> >
> > * Background: before doing a kernel mount, the famfs user space [2] validates
> > the superblock and log. This is done via raw mmap of the primary devdax
> > device. If valid, the file system is mounted, and the superblock and log
> > get exposed through a pair of files (.meta/.superblock and .meta/.log) -
> > because we can't be using raw device mmap when a file system is mounted
> > on the device. But this exposes a devdax bug and warning...
> >
> > * Pages that have been memory mapped via devdax are left in a permanently
> > problematic state. Devdax sets page|folio->mapping when a page is accessed
> > via raw devdax mmap (as famfs does before mount), but never cleans it up.
> > When the pages of the famfs superblock and log are accessed via the "meta"
> > files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
> > notices that page|folio->mapping is still set. I intend to address this
> > prior to asking for the famfs patches to be merged.
> >
> > * Alistair Popple's recent dax patch series [6], which has been merged
> > for 6.15, addresses some dax issues, but sadly does not fix the poisoned
> > page|folio problem - its enhanced refcount checking turns the warning into
> > an error.
> >
> > * This 6.14 patch set disables the warning; a proper fix will be required for
> > famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
> > this properly...
> >
> > * In terms of the correct functionality of famfs, the warning can be ignored.
> >
> > References
> >
> > [1] - https://github.com/libfuse/libfuse/pull/1200
> > [2] - https://github.com/cxl-micron-reskit/famfs
>
> Thanks for posting links, I'll have a look there too.
>
> --D
>
I'm happy to talk if you wanna kick ideas around.
Cheers,
John
Powered by blists - more mailing lists