[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260122003401.GL5966@frogsfrogsfrogs>
Date: Wed, 21 Jan 2026 16:34:01 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 01/31] fuse: implement the basic iomap mechanisms
On Wed, Jan 21, 2026 at 04:06:39PM -0800, Joanne Koong wrote:
> On Wed, Jan 21, 2026 at 2:45 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > On Wed, Jan 21, 2026 at 11:34:24AM -0800, Joanne Koong wrote:
> > > On Tue, Oct 28, 2025 at 5:45 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > >
> > > > From: Darrick J. Wong <djwong@...nel.org>
> > > >
> > > > Implement functions to enable upcalling of iomap_begin and iomap_end to
> > > > userspace fuse servers.
> > > >
> > > > Signed-off-by: "Darrick J. Wong" <djwong@...nel.org>
> > > > ---
> > > > fs/fuse/fuse_i.h | 22 ++
> > > > fs/fuse/iomap_i.h | 36 ++++
> > > > include/uapi/linux/fuse.h | 90 +++++++++
> > > > fs/fuse/Kconfig | 32 +++
> > > > fs/fuse/Makefile | 1
> > > > fs/fuse/file_iomap.c | 434 +++++++++++++++++++++++++++++++++++++++++++++
> > > > fs/fuse/inode.c | 8 +
> > > > 7 files changed, 621 insertions(+), 2 deletions(-)
> > > > create mode 100644 fs/fuse/iomap_i.h
> > > > create mode 100644 fs/fuse/file_iomap.c
> > > >
> > > >
> > > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > > index 7c7d255d817f1e..45be59df7ae592 100644
> > > > --- a/fs/fuse/fuse_i.h
> > > > +++ b/fs/fuse/fuse_i.h
> > > > diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
> > > > index 18713cfaf09171..7d709cf12b41a7 100644
> > > > --- a/include/uapi/linux/fuse.h
> > > > +++ b/include/uapi/linux/fuse.h
> > > > @@ -240,6 +240,9 @@
> > > > * - add FUSE_COPY_FILE_RANGE_64
> > > > * - add struct fuse_copy_file_range_out
> > > > * - add FUSE_NOTIFY_PRUNE
> > > > + *
> > > > + * 7.99
> > >
> > > Should this be changed to something like 7.46 now that this patch is
> > > submitted for merging into the tree?
> >
> > When review of this patchset nears completion I'll change the 99s to
> > 46 or whatever the fuse/libfuse minor version happens to be at that
> > point.
>
> Sounds good.
I'll add another XXX comment here to increase the likelihood it doesn't
get missed.
> >
> > Nobody's touched this series since 29 October (during 6.19 development)
> > and I've been busy with xfs_healer so I'm not submitting this for 7.0
> > either.
> >
> > > > + * - add FUSE_IOMAP and iomap_{begin,end,ioend} for regular file operations
> > > > */
> > > >
> > > > +/* fuse-specific mapping type indicating that writes use the read mapping */
> > > > +#define FUSE_IOMAP_TYPE_PURE_OVERWRITE (255)
> > > > +
> > > > +#define FUSE_IOMAP_DEV_NULL (0U) /* null device cookie */
> > > > +
> > > > +/* mapping flags passed back from iomap_begin; see corresponding IOMAP_F_ */
> > > > +#define FUSE_IOMAP_F_NEW (1U << 0)
> > > > +#define FUSE_IOMAP_F_DIRTY (1U << 1)
> > > > +#define FUSE_IOMAP_F_SHARED (1U << 2)
> > > > +#define FUSE_IOMAP_F_MERGED (1U << 3)
> > > > +#define FUSE_IOMAP_F_BOUNDARY (1U << 4)
> > > > +#define FUSE_IOMAP_F_ANON_WRITE (1U << 5)
> > > > +#define FUSE_IOMAP_F_ATOMIC_BIO (1U << 6)
> > >
> > > Do you think it makes sense to have the fuse iomap constants mirror
> > > the in-kernel iomap ones? Maybe I'm mistaken but it seems like the
> > > fuse iomap capabilities won't diverge too much from fs/iomap ones? I
> > > like that if they're mirrored, then it makes it simpler instead of
> > > needing to convert back and forth.
> >
> > "Mirrored"? As in, having the define use a symbol:
> >
> > #define FUSE_IOMAP_F_NEW IOMAP_F_NEW
> >
> > instead of defining it to be a specific numerical constant like it is
> > here?
>
> I was thinking keeping it like it is with defining it to a specific
> numerical constant, but having the number correspond to the number
> iomap.h uses and having static asserts to ensure they match, and then
> being able to just pass struct fuse_iomap_io's flags directly to
> iomap->flags and vice versa. But I guess the iomap constants could
> change at any time since it's not a uapi.
Yep. iomap's api stability is only guaranteed until the mtime changes
on include/linux/iomap.h.
I actually /did/ do the static assert thing earlier in the lifetime of
this patchset, but then I godbolted what the conversion functions were
actually doing and observed that gcc and clang are smart enough to
collapse all the C code into the appropriate masking if you compile with
-O2.
<snip>
> > > > +struct fuse_iomap_io {
> > > > + uint64_t offset; /* file offset of mapping, bytes */
> > > > + uint64_t length; /* length of mapping, bytes */
> > > > + uint64_t addr; /* disk offset of mapping, bytes */
> > > > + uint16_t type; /* FUSE_IOMAP_TYPE_* */
> > > > + uint16_t flags; /* FUSE_IOMAP_F_* */
> > > > + uint32_t dev; /* device cookie */
> > >
> > > Do you think it's a good idea to add a reserved field here in case we
> > > end up needing it in the future?
> >
> > I'm open to the idea of pre-padding the structs, though that's extra
> > copy overhead until they get used for something.
>
> Bernd would know better than me on this, but iirc, fuse generally
> tries to prepad structs to avoid having to deal with backwards
> compatibility issues if future fields get added.
<nod> for xfs I've generally added one u64 unless two would round us up
to a cacheline... or just defined the struct size to be something insane
like 512 bytes.
> >
> > Does that fuse-iouring-zerocopy patchset that you're working on enable
> > the kernel to avoid copying fuse command data around? I haven't read it
> > in sufficient (or any) detail to know the answer to that question.
>
> No, only the payload bypasses the copy. All the header stuff would
> have to get copied out to the ring.
D'oh! :/
> >
> > Second: how easy is it to send a variable sized fuse command to
> > userspace? It looks like some commands like FUSE_WRITE do things like:
> >
> > if (ff->fm->fc->minor < 9)
> > args->in_args[0].size = FUSE_COMPAT_WRITE_IN_SIZE;
> > else
> > args->in_args[0].size = sizeof(ia->write.in);
> > args->in_args[0].value = &ia->write.in;
> > args->in_args[1].size = count;
> >
> > Which means that future expansion can (in theory) bump the minor version
> > and send larer commands.
> >
> > It also looks like the kernel can support receiving variable-sized
> > responses, like FUSE_READ does:
> >
> > args->out_argvar = true;
> > args->out_numargs = 1;
> > args->out_args[0].size = count;
> >
> > I think this means that if we ever needed to expand the _out struct to
> > allow the fuse server to send back a more lengthy response, we could
> > potentially do that without needing a minor protocol version bump.
>
> I'm not sure, Bernd or Miklos would know more, but my general
> impression has been that we try to avoid doing the FUSE_COMPAT_ stuff
> if we can.
<nod> revving the minor protocol version will take time to propagate.
<snip>
> > > > +};
> > > > +
> > > > +struct fuse_iomap_end_in {
> > > > + uint32_t opflags; /* FUSE_IOMAP_OP_* */
> > > > + uint32_t reserved; /* zero */
> > > > + uint64_t attr_ino; /* matches fuse_attr:ino */
> > > > + uint64_t pos; /* file position, in bytes */
> > > > + uint64_t count; /* operation length, in bytes */
> > > > + int64_t written; /* bytes processed */
> > >
> > > On the fs/iomap side, I see that written is passed through by
> > > iomap_iter() to ->iomap_end through 'ssize_t advanced' but it's not
> > > clear to me why advanced needs to be signed. I think it used to also
> > > represent the error status, but it looks like now that's represented
> > > through iter->status and 'advanced' strictly reflects the number of
> > > bytes written. As such, do you think it makes sense to change
> > > 'advanced' to loff_t and have written be uint64_t instead?
> >
> > Not quite -- back in the bad old days, iomap_iter::processed was a s64
> > value that the iteration loop had to set to one of:
> >
> > * a positive number for positive progress
> > * zero to stop the iteration
> > * a negative errno to fail out
> >
> > Nowadays we just move iomap_iter::pos forward via iomap_iter_advance or
> > set status to a negative number to end the iteration.
Slight inaccuracy: one sets iter->status to a negative number to fail
out of the iteration. To end early, they should call iomap_iter without
calling iomap_iter_advance.
> > So yes, I think @advanced should be widened to 64-bits since iomap
> > operations can jump more than 2GB per iter step. Practically speaking I
> > think this hasn't yet been a problem because the only operations that
> > can do that (fiemap, seek, swap) also don't have any client filesystems
> > that implement iomap_end; or they do but never send mappings large
> > enough to cause problems.
> >
> > iomap iters can't go backwards so @advanced could be u64 as well.
> >
> > Also the name of the ->iomap_end parameter could be changed to
> > "advanced" because iomap_end could in theory be called for any
> > operation, not just writes. That's a throwback to the days when the
> > iomap code was just part of xfs. It also is an unsigned quantity.
>
> That makes sense, thanks for the context.
<nod>
<snip>
> > > > +/* Convert a mapping from the server into something the kernel can use */
> > > > +static inline void fuse_iomap_from_server(struct inode *inode,
> > >
> > > Maybe worth adding a const in front of struct inode?
> >
> > It can go away in a patch or two when we wire up bdev support.
> >
> > Though considering that fuse_iomap_enabled returns false all the way to
> > the end of the patchset I guess I could just set bdev to null and skip
> > passing in the inode at all.
Done.
> > > > + struct iomap *iomap,
> > > > + const struct fuse_iomap_io *fmap)
> > > > +{
> > > > + iomap->addr = fmap->addr;
> > > > + iomap->offset = fmap->offset;
> > > > + iomap->length = fmap->length;
> > > > + iomap->type = fuse_iomap_type_from_server(fmap->type);
> > > > + iomap->flags = fuse_iomap_flags_from_server(fmap->flags);
> > > > + iomap->bdev = inode->i_sb->s_bdev; /* XXX */
> > > > +}
> > > > +
> > > > +/* Convert a mapping from the kernel into something the server can use */
> > > > +static inline void fuse_iomap_to_server(struct fuse_iomap_io *fmap,
> > > > + const struct iomap *iomap)
> > > > +{
> > > > + fmap->addr = FUSE_IOMAP_NULL_ADDR; /* XXX */
> > > > + fmap->offset = iomap->offset;
> > > > + fmap->length = iomap->length;
> > > > + fmap->type = fuse_iomap_type_to_server(iomap->type);
> > > > + fmap->flags = fuse_iomap_flags_to_server(iomap->flags);
> > > > + fmap->dev = FUSE_IOMAP_DEV_NULL; /* XXX */
> > >
> > > AFAICT, this only gets used for sending the FUSE_IOMAP_END request. Is
> > > passing the iomap->addr to fmap->addr and inode->i_sb->s_bdev to
> > > fmap->dev not useful to the server here?
> >
> > So far the only fields I've needed in fuse4fs are the
> > offset/count/written fields as provided by iomap_iter, and the flags
> > field from the mapping. The addr field isn't necessary for fuse4fs
> > because the fuse server would know if the mapping had changed. OTOH
> > it's probably harmless to send it along.
> >
> > Hrm. I probably need a way to look up the backing_id from the iomap
> > bdev.
> >
> > Looking further ahead at the ioend patch, I just realized that iomap
> > ioends can tell you the new address of a write-append operation but they
> > don't tell you which device. I guess you can read that from the
> > ioend->io_bio.bi_bdev.
> >
> > > Also, did you mean to leave in the /* XXX */ comments?
> >
> > Yes, because they're a reminder to come back and check if I /ever/
> > needed them.
>
> Makes sense, seems like you're planning to remove them when the patch
> is ready to merge, if I understand correctly.
Yeah. I also fixed this fuse_iomap_to_server to set fmap->dev.
<snip>
> > > > +static int fuse_iomap_end(struct inode *inode, loff_t pos, loff_t count,
> > > > + ssize_t written, unsigned opflags,
> > > > + struct iomap *iomap)
> > > > +{
> > > > + struct fuse_inode *fi = get_fuse_inode(inode);
> > > > + struct fuse_mount *fm = get_fuse_mount(inode);
> > > > + int err = 0;
> > > > +
> > > > + if (fuse_should_send_iomap_end(iomap, opflags, count, written)) {
> > > > + struct fuse_iomap_end_in inarg = {
> > > > + .opflags = fuse_iomap_op_to_server(opflags),
> > > > + .attr_ino = fi->orig_ino,
> > > > + .pos = pos,
> > > > + .count = count,
> > > > + .written = written,
> > > > + };
> > > > + FUSE_ARGS(args);
> > > > +
> > > > + fuse_iomap_to_server(&inarg.map, iomap);
> > > > +
> > > > + args.opcode = FUSE_IOMAP_END;
> > > > + args.nodeid = get_node_id(inode);
> > >
> > > Just curious about this - does it make sense to set args.force here
> > > for this opcode? It seems like it serves the same sort of purpose a
> > > flush request (which sets args.force) does?
> >
> > What does args.force do? There's no documentation of what behaviors
> > these fields are supposed to trigger.
>
> The args.force forces the request to be sent even if it gets
> interrupted by a signal. It'll also bypass the fuse_block_alloc()
> check when sending the request, but I don't think that's too relevant
> to this case.
Hrm. For iomap_begin I think it's ok if a signal kills the IO
operation. For iomap_end ... I guess we really should force the command
out to the server in case it needs to clean up, even if the user is
hammering on kill -9.
For iomap_ioend the same probably applies, but it's called from
workqueue context so there's not going to be a fatal signal. But maybe
we should do that, just in case someone develops motivation to make
directio completions run in the caller's context or something.
--D
Powered by blists - more mailing lists