[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-Id: <20250421013346.32530-1-john@groves.net>
Date: Sun, 20 Apr 2025 20:33:27 -0500
From: John Groves <John@...ves.net>
To: John Groves <John@...ves.net>,
Dan Williams <dan.j.williams@...el.com>,
Miklos Szeredi <miklos@...redb.hu>,
Bernd Schubert <bschubert@....com>
Cc: John Groves <jgroves@...ron.com>,
Jonathan Corbet <corbet@....net>,
Vishal Verma <vishal.l.verma@...el.com>,
Dave Jiang <dave.jiang@...el.com>,
Matthew Wilcox <willy@...radead.org>,
Jan Kara <jack@...e.cz>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>,
"Darrick J . Wong" <djwong@...nel.org>,
Luis Henriques <luis@...lia.com>,
Randy Dunlap <rdunlap@...radead.org>,
Jeff Layton <jlayton@...nel.org>,
Kent Overstreet <kent.overstreet@...ux.dev>,
Petr Vorel <pvorel@...e.cz>,
Brian Foster <bfoster@...hat.com>,
linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org,
nvdimm@...ts.linux.dev,
linux-cxl@...r.kernel.org,
linux-fsdevel@...r.kernel.org,
Amir Goldstein <amir73il@...il.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Stefan Hajnoczi <shajnocz@...hat.com>,
Joanne Koong <joannelkoong@...il.com>,
Josef Bacik <josef@...icpanda.com>,
Aravind Ramesh <arramesh@...ron.com>,
Ajay Joshi <ajayjoshi@...ron.com>,
John Groves <john@...ves.net>
Subject: [RFC PATCH 00/19] famfs: port into fuse
Subject: famfs: port into fuse
This is the initial RFC for the fabric-attached memory file system (famfs)
integration into fuse. In order to function, this requires a related patch
to libfuse [1] and the famfs user space [2].
This RFC is mainly intended to socialize the approach and get feedback from
the fuse developers and maintainers. There is some dax work that needs to
be done before this should be merged (see the "poisoned page|folio problem"
below).
This patch set fully works with Linux 6.14 -- passing all existing famfs
smoke and unit tests -- and I encourage existing famfs users to test it.
This is really two patch sets mashed up:
* The patches with the dev_dax_iomap: prefix fill in missing functionality for
devdax to host an fs-dax file system.
* The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
unchanged since last year.
Because this is not ready to merge yet, I have felt free to leave some debug
prints in place because we still find them useful; those will be cleaned up
in a subsequent revision.
Famfs Overview
Famfs exposes shared memory as a file system. Famfs consumes shared memory
from dax devices, and provides memory-mappable files that map directly to
the memory - no page cache involvement. Famfs differs from conventional
file systems in fs-dax mode, in that it handles in-memory metadata in a
sharable way (which begins with never caching dirty shared metadata).
Famfs started as a standalone file system [3,4], but the consensus at LSFMM
2024 [5] was that it should be ported into fuse - and this RFC is the first
public evidence that I've been working on that.
The key performance requirement is that famfs must resolve mapping faults
without upcalls. This is achieved by fully caching the file-to-devdax
metadata for all active files. This is done via two fuse client/server
message/response pairs: GET_FMAP and GET_DAXDEV.
Famfs remains the first fs-dax file system that is backed by devdax rather
than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
Notes
* Once the dev_dax_iomap patches land, I suspect it may make sense for
virtiofs to update to use the improved interface.
* I'm currently maintaining compatibility between the famfs user space and
both the standalone famfs kernel file system and this new fuse
implementation. In the near future I'll be running performance comparisons
and sharing them - but there is no reason to expect significant degradation
with fuse, since famfs caches entire "fmaps" in the kernel to resolve
faults with no upcalls. This patch has a bit too much debug turned on to
to that testing quite yet. A branch
* Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
* When a file is looked up in a famfs mount, the LOOKUP is followed by a
GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
allowing the fuse/famfs kernel code to handle read/write/fault without any
upcalls.
* After each GET_FMAP, the fmap is checked for extents that reference
previously-unknown daxdevs. Each such occurence is handled with a
GET_DAXDEV message and response.
* Daxdevs are stored in a table (which might become an xarray at some point).
When entries are added to the table, we acquire exclusive access to the
daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
with pmem devices). famfs provides holder_operations to devdax, providing
a notification path in the event of memory errors.
* If devdax notifies famfs of memory errors on a dax device, famfs currently
bocks all subsequent accesses to data on that device. The recovery is to
re-initialize the memory and file system. Famfs is memory, not storage...
* Because famfs uses backing (devdax) devices, only privileged mounts are
supported.
* The famfs kernel code never accesses the memory directly - it only
facilitates read, write and mmap on behalf of user processes. As such,
the RAS of the shared memory affects applications, but not the kernel.
* Famfs has backing device(s), but they are devdax (char) rather than
block. Right now there is no way to tell the vfs layer that famfs has a
char backing device (unless we say it's block, but it's not). Currently
we use the standard anonymous fuse fs_type - but I'm not sure that's
ultimately optimal (thoughts?)
The "poisoned page|folio problem"
* Background: before doing a kernel mount, the famfs user space [2] validates
the superblock and log. This is done via raw mmap of the primary devdax
device. If valid, the file system is mounted, and the superblock and log
get exposed through a pair of files (.meta/.superblock and .meta/.log) -
because we can't be using raw device mmap when a file system is mounted
on the device. But this exposes a devdax bug and warning...
* Pages that have been memory mapped via devdax are left in a permanently
problematic state. Devdax sets page|folio->mapping when a page is accessed
via raw devdax mmap (as famfs does before mount), but never cleans it up.
When the pages of the famfs superblock and log are accessed via the "meta"
files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
notices that page|folio->mapping is still set. I intend to address this
prior to asking for the famfs patches to be merged.
* Alistair Popple's recent dax patch series [6], which has been merged
for 6.15, addresses some dax issues, but sadly does not fix the poisoned
page|folio problem - its enhanced refcount checking turns the warning into
an error.
* This 6.14 patch set disables the warning; a proper fix will be required for
famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
this properly...
* In terms of the correct functionality of famfs, the warning can be ignored.
References
[1] - https://github.com/libfuse/libfuse/pull/1200
[2] - https://github.com/cxl-micron-reskit/famfs
[3] - https://lore.kernel.org/linux-cxl/cover.1708709155.git.john@groves.net/
[4] - https://lore.kernel.org/linux-cxl/cover.1714409084.git.john@groves.net/
[5] - https://lwn.net/Articles/983105/
[6] - https://lore.kernel.org/linux-cxl/cover.8068ad144a7eea4a813670301f4d2a86a8e68ec4.1740713401.git-series.apopple@nvidia.com/
John Groves (19):
dev_dax_iomap: Move dax_pgoff_to_phys() from device.c to bus.c
dev_dax_iomap: Add fs_dax_get() func to prepare dax for fs-dax usage
dev_dax_iomap: Save the kva from memremap
dev_dax_iomap: Add dax_operations for use by fs-dax on devdax
dev_dax_iomap: export dax_dev_get()
dev_dax_iomap: (ignore!) Drop poisoned page warning in fs/dax.c
famfs_fuse: magic.h: Add famfs magic numbers
famfs_fuse: Kconfig
famfs_fuse: Update macro s/FUSE_IS_DAX/FUSE_IS_VIRTIO_DAX/
famfs_fuse: Basic fuse kernel ABI enablement for famfs
famfs_fuse: Basic famfs mount opts
famfs_fuse: Plumb the GET_FMAP message/response
famfs_fuse: Create files with famfs fmaps
famfs_fuse: GET_DAXDEV message and daxdev_table
famfs_fuse: Plumb dax iomap and fuse read/write/mmap
famfs_fuse: Add holder_operations for dax notify_failure()
famfs_fuse: Add famfs metadata documentation
famfs_fuse: Add documentation
famfs_fuse: (ignore) debug cruft
Documentation/filesystems/famfs.rst | 142 ++++
Documentation/filesystems/index.rst | 1 +
MAINTAINERS | 10 +
drivers/dax/Kconfig | 6 +
drivers/dax/bus.c | 144 +++-
drivers/dax/dax-private.h | 1 +
drivers/dax/device.c | 38 +-
drivers/dax/super.c | 33 +-
fs/dax.c | 1 -
fs/fuse/Kconfig | 13 +
fs/fuse/Makefile | 4 +-
fs/fuse/dev.c | 61 ++
fs/fuse/dir.c | 74 +-
fs/fuse/famfs.c | 1105 +++++++++++++++++++++++++++
fs/fuse/famfs_kfmap.h | 166 ++++
fs/fuse/file.c | 27 +-
fs/fuse/fuse_i.h | 67 +-
fs/fuse/inode.c | 49 +-
fs/fuse/iomode.c | 2 +-
fs/namei.c | 1 +
include/linux/dax.h | 6 +
include/uapi/linux/fuse.h | 63 ++
include/uapi/linux/magic.h | 2 +
23 files changed, 1973 insertions(+), 43 deletions(-)
create mode 100644 Documentation/filesystems/famfs.rst
create mode 100644 fs/fuse/famfs.c
create mode 100644 fs/fuse/famfs_kfmap.h
base-commit: 38fec10eb60d687e30c8c6b5420d86e8149f7557
--
2.49.0
Powered by blists - more mailing lists