lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250428190010.GB1035866@frogsfrogsfrogs>
Date: Mon, 28 Apr 2025 12:00:10 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: John Groves <John@...ves.net>
Cc: Dan Williams <dan.j.williams@...el.com>,
	Miklos Szeredi <miklos@...redi.hu>,
	Bernd Schubert <bschubert@....com>,
	John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>,
	Vishal Verma <vishal.l.verma@...el.com>,
	Dave Jiang <dave.jiang@...el.com>,
	Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>,
	Luis Henriques <luis@...lia.com>,
	Randy Dunlap <rdunlap@...radead.org>,
	Jeff Layton <jlayton@...nel.org>,
	Kent Overstreet <kent.overstreet@...ux.dev>,
	Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
	nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>,
	Jonathan Cameron <Jonathan.Cameron@...wei.com>,
	Stefan Hajnoczi <shajnocz@...hat.com>,
	Joanne Koong <joannelkoong@...il.com>,
	Josef Bacik <josef@...icpanda.com>,
	Aravind Ramesh <arramesh@...ron.com>,
	Ajay Joshi <ajayjoshi@...ron.com>, 0@...ves.net
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps

On Sun, Apr 27, 2025 at 08:48:30PM -0500, John Groves wrote:
> On 25/04/24 07:38AM, Darrick J. Wong wrote:
> > On Thu, Apr 24, 2025 at 08:43:33AM -0500, John Groves wrote:
> > > On 25/04/20 08:33PM, John Groves wrote:
> > > > On completion of GET_FMAP message/response, setup the full famfs
> > > > metadata such that it's possible to handle read/write/mmap directly to
> > > > dax. Note that the devdax_iomap plumbing is not in yet...
> > > > 
> > > > Update MAINTAINERS for the new files.
> > > > 
> > > > Signed-off-by: John Groves <john@...ves.net>
> > > > ---
> > > >  MAINTAINERS               |   9 +
> > > >  fs/fuse/Makefile          |   2 +-
> > > >  fs/fuse/dir.c             |   3 +
> > > >  fs/fuse/famfs.c           | 344 ++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/famfs_kfmap.h     |  63 +++++++
> > > >  fs/fuse/fuse_i.h          |  16 +-
> > > >  fs/fuse/inode.c           |   2 +-
> > > >  include/uapi/linux/fuse.h |  42 +++++
> > > >  8 files changed, 477 insertions(+), 4 deletions(-)
> > > >  create mode 100644 fs/fuse/famfs.c
> > > >  create mode 100644 fs/fuse/famfs_kfmap.h
> > > > 
> > 
> > <snip>
> > 
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index d85fb692cf3b..0f6ff1ffb23d 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -1286,4 +1286,46 @@ struct fuse_uring_cmd_req {
> > > >  	uint8_t padding[6];
> > > >  };
> > > >  
> > > > +/* Famfs fmap message components */
> > > > +
> > > > +#define FAMFS_FMAP_VERSION 1
> > > > +
> > > > +#define FUSE_FAMFS_MAX_EXTENTS 2
> > > > +#define FUSE_FAMFS_MAX_STRIPS 16
> > > 
> > > FYI, after thinking through the conversation with Darrick,  I'm planning 
> > > to drop FUSE_FAMFS_MAX_(EXTENTS|STRIPS) in the next version.  In the 
> > > response to GET_FMAP, it's the structures below serialized into a message 
> > > buffer. If it fits, it's good - and if not it's invalid. When the
> > > in-memory metadata (defined in famfs_kfmap.h) gets assembled, if there is
> > > a reason to apply limits it can be done - but I don't currently see a reason
> > > do to that (so if I'm currently enforcing limits there, I'll probably drop
> > > that.
> > 
> > You could also define GET_FMAP to have an offset in the request buffer,
> > and have the famfs daemon send back the next offset at the end of its
> > reply (or -1ULL to stop).  Then the kernel can call GET_FMAP again with
> > that new offset to get more mappings.
> > 
> > Though at this point maybe it should go the /other/ way, where the fuse
> > server can sends a "notification" to the kernel to populate its mapping
> > data?  fuse already defines a handful of notifications for invalidating
> > pagecache and directory links.
> > 
> > (Ugly wart: notifications aren't yet implemented for the iouring channel)
> 
> I don't have fully-formed thoughts about notifications yet; thinking...

Me neither.  The existing ones seem like they /could/ be useful for 

> If the fmap stuff may be shared by more than one use case (as has always
> seemed possible), it's a good idea to think through a couple of things: 
> 1) is there anything important missing from this general approach, and 

Well for general iomap caching, I think we'd need to pull in a lot more
of the iomap fields:

struct fuse_iomap {
	u64		addr;	/* disk offset of mapping, bytes */
	loff_t		offset;	/* file offset of mapping, bytes */
	u64		length;	/* length of mapping, bytes */
	u16		type;	/* type of mapping */
	u16		flags;	/* flags for mapping */
	u32		devindex;
	u64		validity_cookie; /* used with .iomap_valid() */
};

fuse would use devindex to find the block_device/dax_device, but
otherwise the fields are exactly the same as struct iomap.  Given that
this is exposed to userspace we'd probably want to add some padding.

The validity cookie I'm not 100% sure about -- buffered IO uses it to
detect stale iomappings after we've locked a folio for write, having
dropped whatever locks protect the iomappings.  The ->iomap_valid
function compares the iomap::validity_cookie against some internal magic
value (this would have to be the iomap cache) to decide if revalidation
is needed.

One way to make this work is to implement the cookie entirely within the
fuse-iomap cache itself -- every time a new mapping comes in (or a range
gets invalidated) the cache bumps its cookie.  The fuse server doesn't
have to implement the cookie itself, but it will have to push a new
mapping or invalidate something every time the mappings change.

Another way would be to have the fuse server implement the cookie
itself, but now we have to find a way to have the kernel and userspace
share a piece of memory where the cookie lives.  I don't like this
option, but it does give the fuse server direct control over when the
cookie value changes.

> 2) do you need to *partially* cache fmaps? (or is the "offset" idea above 
> just to deal with an fmap that might otherwise overflow a response size?)

It's mostly to cap the amount of mapping data being copied into the
kernel in a specific GET_FMAP call.  For famfs I don't think you have
that many mappings, but for (say) an XFS filesystem there could be
billions of them.

Though at that point it might make more sense to populate the cache
piecemeal as file IO actually happens.

I wouldn't split an existing mapping, FWIW.  Think "I have 1,000,000
mappings and I'm only going to upload them 1,000 at a time", not "I'm
going to upload mappings for 100MB worth of file range at a time".

> The current approach lets the kernel retrieve and cache simple and 
> interleaved fmaps (and BTW interleaved can be multi-dev or single-dev - 
> there are current weird cases where that's useful). Also too, FWIW everything
> that can be done with simple ext list fmaps can be done with a collection
> of interleaved extents, each with strip count = 1. But I think there is a
> worthwhile clarity to having both.

<nod> I don't know what Miklos' opinion is about having multiple
fusecmds that do similar things -- on the one hand keeping yours and my
efforts separate explodes the amount of userspace abi that everyone must
maintain, but on the other hand it then doesn't couple our projects
together, which might be a good thing if it turns out that our domain
models are /really/ actually quite different.

(Especially because I suspect that interleaving is the norm for memory,
whereas we try to avoid that for disk filesystems.)

> But the current implementation does not contemplate partially cached fmaps.
> 
> Adding notification could address revoking them post-haste (is that why
> you're thinking about notifications? And if not can you elaborate on what
> you're after there?).

Yeah, invalidating the mapping cache at random places.  If, say, you
implement a clustered filesystem with iomap, the metadata server could
inform the fuse server on the local node that a certain range of inode X
has been written to, at which point you need to revoke any local leases,
invalidate the pagecache, and invalidate the iomapping cache to force
the client to requery the server.

Or if your fuse server wants to implement its own weird operations (e.g.
XFS EXCHANGE-RANGE) this would make that possible without needing to
add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.

--D

> 
> > 
> > --D
> 
> Cheers,
> John
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ