[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250513040321.GO1035866@frogsfrogsfrogs>
Date: Mon, 12 May 2025 21:03:21 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: John Groves <John@...ves.net>
Cc: Miklos Szeredi <miklos@...redi.hu>,
Dan Williams <dan.j.williams@...el.com>,
Bernd Schubert <bschubert@....com>,
John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>,
Vishal Verma <vishal.l.verma@...el.com>,
Dave Jiang <dave.jiang@...el.com>,
Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
Alexander Viro <viro@...iv.linux.org.uk>,
Christian Brauner <brauner@...nel.org>,
Luis Henriques <luis@...lia.com>,
Randy Dunlap <rdunlap@...radead.org>,
Jeff Layton <jlayton@...nel.org>,
Kent Overstreet <kent.overstreet@...ux.dev>,
Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>,
Jonathan Cameron <Jonathan.Cameron@...wei.com>,
Stefan Hajnoczi <shajnocz@...hat.com>,
Joanne Koong <joannelkoong@...il.com>,
Josef Bacik <josef@...icpanda.com>,
Aravind Ramesh <arramesh@...ron.com>,
Ajay Joshi <ajayjoshi@...ron.com>
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps
On Mon, May 12, 2025 at 02:51:45PM -0500, John Groves wrote:
> On 25/05/06 06:56PM, Miklos Szeredi wrote:
> > On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > > <nod> I don't know what Miklos' opinion is about having multiple
> > > fusecmds that do similar things -- on the one hand keeping yours and my
> > > efforts separate explodes the amount of userspace abi that everyone must
> > > maintain, but on the other hand it then doesn't couple our projects
> > > together, which might be a good thing if it turns out that our domain
> > > models are /really/ actually quite different.
> >
> > Sharing the interface at least would definitely be worthwhile, as
> > there does not seem to be a great deal of difference between the
> > generic one and the famfs specific one. Only implementing part of the
> > functionality that the generic one provides would be fine.
>
> Agreed. I'm coming around to thinking the most practical approach would be
> to share the GET_FMAP message/response, but to add a separate response
> format for Darrick's use case - when the time comes. In this patch set,
> that starts with 'struct fuse_famfs_fmap_header' and is followed by the
> approriate extent structures, serialized in the message. Collectively
> that's an fmap in message format.
Well in that case I might as well just plumb in the pieces I need as
separate fuse commands. fuse_args::opcode is u32, there's plenty of
space left.
> Side note: the current patch set sends back the logically-variable-sized
> fmap in a fixed-size message, but V2 of the series will address that;
> I got some help from Bernd there, but haven't finished it yet.
>
> So the next version of the patch set would, say, add a more generic first
> 'struct fmap_header' that would indicate whether the next item would be
> 'struct fuse_famfs_fmap_header' (i.e. my/famfs metadata) or some other
> to be codified metadata format. I'm going here because I'm dubious that
> we even *can* do grand-unified-fmap-metadata (or that we should try).
>
> This will require versioning the affected structures, unless we think
> the fmap-in-message structure can be opaque to the rest of fuse. @miklos,
> is there an example to follow regarding struct versioning in
> already-existing fuse structures?
/me is a n00b, but isn't that a simple matter of making sure that new
revisions change the structure size, and then you can key off of that?
> > > (Especially because I suspect that interleaving is the norm for memory,
> > > whereas we try to avoid that for disk filesystems.)
> >
> > So interleaved extents are just like normal ones except they repeat,
> > right? What about adding a special "repeat last N extent
> > descriptions" type of extent?
>
> It's a bit more than that. The comment at [1] makes it possible to understand
> the scheme, but I'd be happy to talk through it with you on a call if that
> seems helpful.
>
> An interleaved extent stripes data spread across N memory devices in raid 0
> format; the space from each device is described by a single simple extent
> (so it's contigous), but it's not consumed contiguously - it's consumed in
> fixed-sized chunks that precess across the devices. Notwithstanding that I
> couldn't explain it very well when we talked about it at LPC, I think I
> could make it pretty clear in a pretty brief call now.
>
> In any case, you have my word that it's actually quite elegant :D
> (seriously, but also with a smile...)
Admittedly the more I think about the interleaving in famfs vs straight
block mappings for disk filesystems, the more I think they ought to be
separate interfaces for code that solves different problems. Then both
our codebases will remain relatively cohesive.
> > > > But the current implementation does not contemplate partially cached fmaps.
> > > >
> > > > Adding notification could address revoking them post-haste (is that why
> > > > you're thinking about notifications? And if not can you elaborate on what
> > > > you're after there?).
> > >
> > > Yeah, invalidating the mapping cache at random places. If, say, you
> > > implement a clustered filesystem with iomap, the metadata server could
> > > inform the fuse server on the local node that a certain range of inode X
> > > has been written to, at which point you need to revoke any local leases,
> > > invalidate the pagecache, and invalidate the iomapping cache to force
> > > the client to requery the server.
> > >
> > > Or if your fuse server wants to implement its own weird operations (e.g.
> > > XFS EXCHANGE-RANGE) this would make that possible without needing to
> > > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.
> >
> > Wouldn't existing invalidation framework be sufficient?
> >
> > Thanks,
> > Miklos
>
> My current thinking is that Darrick's use case doesn't need GET_DAXDEV, but
> famfs does. I think Darrick's use case has one backing device, and that should
> be passed in at mount time. Correct me if you think that might be wrong.
Technically speaking iomap can operate on /any/ block or dax device as
long as you have a reference to them. Once I get more of the plumbing
sorted out I'll start thinking about how to handle multi-device
filesystems like XFS which can put file data on more than 1 block
device.
I was thinking that the fuse server could just send a REGISTER_DEVICE
notification to the fuse driver (I know, again with the notifications
:)), the kernel replies with a magic cookie, and that's what gets passed
in the {read,write,map}_dev field.
Right now I reconfigured fuse2fs to present itself as a "fuseblk" driver
so that at least we know that inode->i_sb->s_bdev is a valid pointer.
It turns out to be useful because the kernel sends FUSE_DESTROY commands
synchronously during unmount, which avoids the situation where umount
exits but the block device still can't be opened O_EXCL because the fuse
server program is still exiting. It may be useful for some day wiring
up some of the block device ops to fuse servers. Though I think it
might conflict with CONFIG_BLK_DEV_WRITE_MOUNTED=y
I just barely got directio writes and pagecache read/write working
through iomap today, though I'm still getting used to the fuse inode
locking model and sorting through the bugs. :)
(I wonder how nasty would it be to pass fds to the fuse kernel driver
from fuseblk servers?)
> Famfs doesn't necessarily have just one backing dev, which means that famfs
> could pass in the *primary* backing dev at mount time, but it would still
> need GET_DAXDEV to get the rest. But if I just use GET_FMAP every time, I
> only need one way to do this.
>
> I'll add a few more responses to Darrick's reply...
Hehhe onto that message go I.
--D
>
> Thanks,
> John
>
> [1] https://github.com/cxl-micron-reskit/famfs-linux/blob/c57553c4ca91f0634f137285840ab25be8a87c30/fs/fuse/famfs_kfmap.h#L13
>
>
Powered by blists - more mailing lists