lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <37ss7esexgblholq5wc5caeizhcjpjhjxsghqjtkxjqri4uxjp@gixtdlggap5i>
Date: Mon, 21 Apr 2025 17:00:35 -0500
From: John Groves <John@...ves.net>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Dan Williams <dan.j.williams@...el.com>, 
	Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bschubert@....com>, 
	John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>, 
	Vishal Verma <vishal.l.verma@...el.com>, Dave Jiang <dave.jiang@...el.com>, 
	Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>, 
	Alexander Viro <viro@...iv.linux.org.uk>, Christian Brauner <brauner@...nel.org>, 
	Luis Henriques <luis@...lia.com>, Randy Dunlap <rdunlap@...radead.org>, 
	Jeff Layton <jlayton@...nel.org>, Kent Overstreet <kent.overstreet@...ux.dev>, 
	Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>, linux-doc@...r.kernel.org, 
	linux-kernel@...r.kernel.org, nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org, 
	linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>, 
	Jonathan Cameron <Jonathan.Cameron@...wei.com>, Stefan Hajnoczi <shajnocz@...hat.com>, 
	Joanne Koong <joannelkoong@...il.com>, Josef Bacik <josef@...icpanda.com>, 
	Aravind Ramesh <arramesh@...ron.com>, Ajay Joshi <ajayjoshi@...ron.com>
Subject: Re: [RFC PATCH 00/19] famfs: port into fuse

On 25/04/21 11:27AM, Darrick J. Wong wrote:
> On Sun, Apr 20, 2025 at 08:33:27PM -0500, John Groves wrote:
> > Subject: famfs: port into fuse
> > 
> > This is the initial RFC for the fabric-attached memory file system (famfs)
> > integration into fuse. In order to function, this requires a related patch
> > to libfuse [1] and the famfs user space [2]. 
> > 
> > This RFC is mainly intended to socialize the approach and get feedback from
> > the fuse developers and maintainers. There is some dax work that needs to
> > be done before this should be merged (see the "poisoned page|folio problem"
> > below).
> 
> Note that I'm only looking at the fuse and iomap aspects of this
> patchset.  I don't know the devdax code at all.
> 
> > This patch set fully works with Linux 6.14 -- passing all existing famfs
> > smoke and unit tests -- and I encourage existing famfs users to test it.
> > 
> > This is really two patch sets mashed up:
> > 
> > * The patches with the dev_dax_iomap: prefix fill in missing functionality for
> >   devdax to host an fs-dax file system.
> > * The famfs_fuse: patches add famfs into fs/fuse/. These are effectively
> >   unchanged since last year.
> > 
> > Because this is not ready to merge yet, I have felt free to leave some debug
> > prints in place because we still find them useful; those will be cleaned up
> > in a subsequent revision.
> > 
> > Famfs Overview
> > 
> > Famfs exposes shared memory as a file system. Famfs consumes shared memory
> > from dax devices, and provides memory-mappable files that map directly to
> > the memory - no page cache involvement. Famfs differs from conventional
> > file systems in fs-dax mode, in that it handles in-memory metadata in a
> > sharable way (which begins with never caching dirty shared metadata).
> > 
> > Famfs started as a standalone file system [3,4], but the consensus at LSFMM
> > 2024 [5] was that it should be ported into fuse - and this RFC is the first
> > public evidence that I've been working on that.
> 
> This is very timely, as I just started looking into how I might connect
> iomap to fuse so that most of the hot IO path continues to run in the
> kernel, and userspace block device filesystem drivers merely supply the
> file mappings to the kernel.  In other words, we kick the metadata
> parsing craziness out of the kernel.

Coool!

> 
> > The key performance requirement is that famfs must resolve mapping faults
> > without upcalls. This is achieved by fully caching the file-to-devdax
> > metadata for all active files. This is done via two fuse client/server
> > message/response pairs: GET_FMAP and GET_DAXDEV.
> 
> Heh, just last week I finally got around to laying out how I think I'd
> want to expose iomap through fuse to allow ->iomap_begin/->iomap_end
> upcalls to a fuse server.  Note that I've done zero prototyping but
> "upload all the mappings at open time" seems like a reasonable place for
> me to start looking, especially for a filesystem with static mappings.
> 
> I think what I want to try to build is an in-kernel mapping cache (sort
> of like the one you built), only with upcalls to the fuse server when
> there is no mapping information for a given IO.  I'd probably want to
> have a means for the fuse server to put new mappings into the cache, or
> invalidate existing mappings.
> 
> (famfs obviously is a simple corner-case of that grandiose vision, but I
> still have a long way to get to my larger vision so don't take my words
> as any kind of requirement.)
> 
> > Famfs remains the first fs-dax file system that is backed by devdax rather
> > than pmem in fs-dax mode (hence the need for the dev_dax_iomap fixups).
> > 
> > Notes
> > 
> > * Once the dev_dax_iomap patches land, I suspect it may make sense for
> >   virtiofs to update to use the improved interface.
> > 
> > * I'm currently maintaining compatibility between the famfs user space and
> >   both the standalone famfs kernel file system and this new fuse
> >   implementation. In the near future I'll be running performance comparisons
> >   and sharing them - but there is no reason to expect significant degradation
> >   with fuse, since famfs caches entire "fmaps" in the kernel to resolve
> 
> I'm curious to hear what you find, performance-wise. :)
> 
> >   faults with no upcalls. This patch has a bit too much debug turned on to
> >   to that testing quite yet. A branch 
> 
> A branch ... what?

I trail off sometimes... ;)

> 
> > * Two new fuse messages / responses are added: GET_FMAP and GET_DAXDEV.
> > 
> > * When a file is looked up in a famfs mount, the LOOKUP is followed by a
> >   GET_FMAP message and response. The "fmap" is the full file-to-dax mapping,
> >   allowing the fuse/famfs kernel code to handle read/write/fault without any
> >   upcalls.
> 
> Huh, I'd have thought you'd wait until FUSE_OPEN to start preloading
> mappings into the kernel.

That may be a better approach. Miklos and I discussed it during LPC last year, 
and thought both were options. Having implemented it at LOOKUP time, I think
moving it to open might avoid my READDIRPLUS problem (which is that RDP is a
mashup of READDIR and LOOKUP), therefore might need to add the GET_FMAP
payload. Moving GET_FMAP to open time, would break that connection in a good
way, I think.

> 
> > * After each GET_FMAP, the fmap is checked for extents that reference
> >   previously-unknown daxdevs. Each such occurence is handled with a
> >   GET_DAXDEV message and response.
> 
> I hadn't figured out how this part would work for my silly prototype.
> Just out of curiosity, does the famfs fuse server hold an open fd to the
> storage, in which case the fmap(ping) could just contain the open fd?
> 
> Where are the mappings that are sent from the fuse server?  Is that
> struct fuse_famfs_simple_ext?

See patch 17 or fs/fuse/famfs_kfmap.h for the fmap metadata explanation. 
Famfs currently supports either simple extents (daxdev, offset, length) or 
interleaved ones (which describe each "strip" as a simple extent). I think 
the explanation in famfs_kfmap.h is pretty clear.

A key question is whether any additional basic metadata abstractions would
be needed - because the kernel needs to understand the full scheme.

With disaggregated memory, the interleave approach is nice because it gets
aggregated performance and resolving a file offset to daxdev offset is order
1.

Oh, and there are two fmap formats (ok, more, but the others are legacy ;).
The fmaps-in-messages structs are currently in the famfs section of
include/uapi/linux/fuse.h. And the in-memory version is in 
fs/fuse/famfs_kfmap.h. The former will need to be a versioned interface.
(ugh...)

> 
> > * Daxdevs are stored in a table (which might become an xarray at some point).
> >   When entries are added to the table, we acquire exclusive access to the
> >   daxdev via the fs_dax_get() call (modeled after how fs-dax handles this
> >   with pmem devices). famfs provides holder_operations to devdax, providing
> >   a notification path in the event of memory errors.
> > 
> > * If devdax notifies famfs of memory errors on a dax device, famfs currently
> >   bocks all subsequent accesses to data on that device. The recovery is to
> >   re-initialize the memory and file system. Famfs is memory, not storage...
> 
> Ouch. :)

Cautious initial approach (i.e. I'm trying not to scare people too much ;) 

> 
> > * Because famfs uses backing (devdax) devices, only privileged mounts are
> >   supported.
> > 
> > * The famfs kernel code never accesses the memory directly - it only
> >   facilitates read, write and mmap on behalf of user processes. As such,
> >   the RAS of the shared memory affects applications, but not the kernel.
> > 
> > * Famfs has backing device(s), but they are devdax (char) rather than
> >   block. Right now there is no way to tell the vfs layer that famfs has a
> >   char backing device (unless we say it's block, but it's not). Currently
> >   we use the standard anonymous fuse fs_type - but I'm not sure that's
> >   ultimately optimal (thoughts?)
> 
> Does it work if the fusefs server adds "-o fsname=<devdax cdev>" to the
> fuse_args object?  fuse2fs does that, though I don't recall if that's a
> reasonable thing to do.

The kernel needs to "own" the dax devices. fs-dax on pmem/block calls
fs_dax_get_by_bdev() and passes in holder_operations - which are used for
error upcalls, but also effect exclusive ownership. 

I added fs_dax_get() since the bdev version wasn't really right or char
devdax. But same holder_operations.

I had originally intended to pass in "-o daxdev=<cdev>", but famfs needs to
span multiple daxdevs, in order to interleave for performance. The approach
of retrieving them with GET_DAXDEV handles the generalized case, so "-o"
just amounts to a second way to do the same thing.

"But wait"... I thought. Doesn't the "-o" approach get the primary daxdev
locked up sooner, which might be good? Well, no, because famfs creates a
couple of meta files during mount .meta/.superblock and .meta/.log - and 
those are guaranteed to reference the primary daxdev. So I concluded the -o
approach wasn't worth the trouble (though it's not *much* trouble).

> 
> > The "poisoned page|folio problem"
> > 
> > * Background: before doing a kernel mount, the famfs user space [2] validates
> >   the superblock and log. This is done via raw mmap of the primary devdax
> >   device. If valid, the file system is mounted, and the superblock and log
> >   get exposed through a pair of files (.meta/.superblock and .meta/.log) -
> >   because we can't be using raw device mmap when a file system is mounted
> >   on the device. But this exposes a devdax bug and warning...
> > 
> > * Pages that have been memory mapped via devdax are left in a permanently
> >   problematic state. Devdax sets page|folio->mapping when a page is accessed
> >   via raw devdax mmap (as famfs does before mount), but never cleans it up.
> >   When the pages of the famfs superblock and log are accessed via the "meta"
> >   files after mount, we see a WARN_ONCE() in dax_insert_entry(), which
> >   notices that page|folio->mapping is still set. I intend to address this
> >   prior to asking for the famfs patches to be merged.
> > 
> > * Alistair Popple's recent dax patch series [6], which has been merged
> >   for 6.15, addresses some dax issues, but sadly does not fix the poisoned
> >   page|folio problem - its enhanced refcount checking turns the warning into
> >   an error.
> > 
> > * This 6.14 patch set disables the warning; a proper fix will be required for
> >   famfs to work at all in 6.15. Dan W. and I are actively discussing how to do
> >   this properly...
> > 
> > * In terms of the correct functionality of famfs, the warning can be ignored.
> > 
> > References
> > 
> > [1] - https://github.com/libfuse/libfuse/pull/1200
> > [2] - https://github.com/cxl-micron-reskit/famfs
> 
> Thanks for posting links, I'll have a look there too.
> 
> --D
> 

I'm happy to talk if you wanna kick ideas around.

Cheers,
John


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ