[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1bSVy4=c=N_FfOajs1FE4o8T=Br=jFm7gBDaCGvRpgGVA@mail.gmail.com>
Date: Tue, 27 Jan 2026 11:47:31 -0800
From: Joanne Koong <joannelkoong@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
file IO performance
On Mon, Jan 26, 2026 at 6:22 PM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Mon, Jan 26, 2026 at 04:59:16PM -0800, Joanne Koong wrote:
> > On Tue, Oct 28, 2025 at 5:38 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > >
> > > Hi all,
> > >
> > > This series connects fuse (the userspace filesystem layer) to fs-iomap
> > > to get fuse servers out of the business of handling file I/O themselves.
> > > By keeping the IO path mostly within the kernel, we can dramatically
> > > improve the speed of disk-based filesystems. This enables us to move
> > > all the filesystem metadata parsing code out of the kernel and into
> > > userspace, which means that we can containerize them for security
> > > without losing a lot of performance.
> >
> > I haven't looked through how the fuse2fs or fuse4fs servers are
> > implemented yet (also, could you explain the difference between the
> > two? Which one should we look at to see how it all ties together?),
>
> fuse4fs is a lowlevel fuse server; fuse2fs is a high(?) level fuse
> server. fuse4fs is the successor to fuse2fs, at least on Linux and BSD.
Ah I see, thanks for the explanation. In that case, I'll just look at
fuse4fs then.
>
> > but I wonder if having bpf infrastructure hooked up to fuse would be
> > especially helpful for what you're doing here with fuse iomap. afaict,
> > every read/write whether it's buffered or direct will incur at least 1
> > call to ->iomap_begin() to get the mapping metadata, which will be 2
> > context-switches (and if the server has ->iomap_end() implemented,
> > then 2 more context-switches).
>
> Yes, I agree that's a lot of context switching for file IO...
>
> > But it seems like the logic for retrieving mapping
> > offsets/lengths/metadata should be pretty straightforward?
>
> ...but it gets very cheap if the fuse server can cache mappings in the
> kernel to avoid all that. That is, incidentally, what patchset #7
> implements.
>
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2026-01-22
>
> > If the extent lookups are table lookups or tree
> > traversals without complex side effects, then having
> > ->iomap_begin()/->iomap_end() be executed as a bpf program would avoid
> > the context switches and allow all the caching logic to be moved from
> > the kernel to the server-side (eg using bpf maps).
>
> Hrmm. Now that /is/ an interesting proposal. Does BPF have a data
> structure that supports interval mappings? I think the existing bpf map
Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
Maybe one workaround in the meantime that could work is using a sorted
array map and doing binary search on that, until interval mappings can
be natively supported?
> only does key -> value. Also, is there an upper limit on the size of a
> map? You could have hundreds of millions of maps for a very fragmented
> regular file.
If I'm remembering correctly, there's an upper limit on the number of
map entries, which is bounded by u32
>
> At one point I suggested to the famfs maintainer that it might be
> easier/better to implement the interleaved mapping lookups as bpf
> programs instead of being stuck with a fixed format in the fuse
> userspace abi, but I don't know if he ever implemented that.
This seems like a good use case for it too
>
> > Is this your
> > assessment of it as well or do you think the server-side logic for
> > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > Asking because I'm curious whether this direction makes sense, not
> > because I think it would be a blocker for your series.
>
> For disk-based filesystems I think it would be difficult to model a bpf
> program to do mappings, since they can basically point anywhere and be
> of any size.
Hmm I'm not familiar enough with disk-based filesystems to know what
the "point anywhere and be of any size" means. For the mapping stuff,
doesn't it just point to a block number? Or are you saying the problem
would be there's too many mappings since a mapping could be any size?
I was thinking the issue would be more that there might be other logic
inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
would need to be done that would be too out-of-scope for bpf. But I
think I need to read through the fuse4fs stuff to understand more what
it's doing in those functions.
Thanks,
Joanne
>
> OTOH it would be enormously hilarious to me if one could load a file
> mapping predictive model into the kernel as a bpf program and use that
> as a first tier before checking the in-memory btree mapping cache from
> patchset 7. Quite a few years ago now there was a FAST paper
> establishing that even a stupid linear regression model could in theory
> beat a disk btree lookup.
>
> --D
>
> > Thanks,
> > Joanne
> >
> > >
> > > If you're going to start using this code, I strongly recommend pulling
> > > from my git trees, which are linked below.
> > >
> > > This has been running on the djcloud for months with no problems. Enjoy!
> > > Comments and questions are, as always, welcome.
> > >
> > > --D
> > >
> > > kernel git tree:
> > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > ---
> > > Commits in this patchset:
> > > * fuse: implement the basic iomap mechanisms
> > > * fuse_trace: implement the basic iomap mechanisms
> > > * fuse: make debugging configurable at runtime
> > > * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > * fuse: create a per-inode flag for toggling iomap
> > > * fuse_trace: create a per-inode flag for toggling iomap
> > > * fuse: isolate the other regular file IO paths from iomap
> > > * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > * fuse: implement direct IO with iomap
> > > * fuse_trace: implement direct IO with iomap
> > > * fuse: implement buffered IO with iomap
> > > * fuse_trace: implement buffered IO with iomap
> > > * fuse: implement large folios for iomap pagecache files
> > > * fuse: use an unrestricted backing device with iomap pagecache io
> > > * fuse: advertise support for iomap
> > > * fuse: query filesystem geometry when using iomap
> > > * fuse_trace: query filesystem geometry when using iomap
> > > * fuse: implement fadvise for iomap files
> > > * fuse: invalidate ranges of block devices being used for iomap
> > > * fuse_trace: invalidate ranges of block devices being used for iomap
> > > * fuse: implement inline data file IO via iomap
> > > * fuse_trace: implement inline data file IO via iomap
> > > * fuse: allow more statx fields
> > > * fuse: support atomic writes with iomap
> > > * fuse_trace: support atomic writes with iomap
> > > * fuse: disable direct reclaim for any fuse server that uses iomap
> > > * fuse: enable swapfile activation on iomap
> > > * fuse: implement freeze and shutdowns for iomap filesystems
> > > ---
> > > fs/fuse/fuse_i.h | 161 +++
> > > fs/fuse/fuse_trace.h | 939 +++++++++++++++++++
> > > fs/fuse/iomap_i.h | 52 +
> > > include/uapi/linux/fuse.h | 219 ++++
> > > fs/fuse/Kconfig | 48 +
> > > fs/fuse/Makefile | 1
> > > fs/fuse/backing.c | 12
> > > fs/fuse/dev.c | 30 +
> > > fs/fuse/dir.c | 120 ++
> > > fs/fuse/file.c | 133 ++-
> > > fs/fuse/file_iomap.c | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > fs/fuse/inode.c | 162 +++
> > > fs/fuse/iomode.c | 2
> > > fs/fuse/trace.c | 2
> > > 14 files changed, 4056 insertions(+), 55 deletions(-)
> > > create mode 100644 fs/fuse/iomap_i.h
> > > create mode 100644 fs/fuse/file_iomap.c
> > >
> >
Powered by blists - more mailing lists