[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260128003431.GX5910@frogsfrogsfrogs>
Date: Tue, 27 Jan 2026 16:34:31 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
file IO performance
On Tue, Jan 27, 2026 at 04:10:43PM -0800, Joanne Koong wrote:
> On Tue, Jan 27, 2026 at 3:21 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > On Tue, Jan 27, 2026 at 11:47:31AM -0800, Joanne Koong wrote:
> > > On Mon, Jan 26, 2026 at 6:22 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > >
> > > > On Mon, Jan 26, 2026 at 04:59:16PM -0800, Joanne Koong wrote:
> > > > > On Tue, Oct 28, 2025 at 5:38 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > This series connects fuse (the userspace filesystem layer) to fs-iomap
> > > > > > to get fuse servers out of the business of handling file I/O themselves.
> > > > > > By keeping the IO path mostly within the kernel, we can dramatically
> > > > > > improve the speed of disk-based filesystems. This enables us to move
> > > > > > all the filesystem metadata parsing code out of the kernel and into
> > > > > > userspace, which means that we can containerize them for security
> > > > > > without losing a lot of performance.
> > > > >
> > > > > I haven't looked through how the fuse2fs or fuse4fs servers are
> > > > > implemented yet (also, could you explain the difference between the
> > > > > two? Which one should we look at to see how it all ties together?),
> > > >
> > > > fuse4fs is a lowlevel fuse server; fuse2fs is a high(?) level fuse
> > > > server. fuse4fs is the successor to fuse2fs, at least on Linux and BSD.
> > >
> > > Ah I see, thanks for the explanation. In that case, I'll just look at
> > > fuse4fs then.
> > >
> > > >
> > > > > but I wonder if having bpf infrastructure hooked up to fuse would be
> > > > > especially helpful for what you're doing here with fuse iomap. afaict,
> > > > > every read/write whether it's buffered or direct will incur at least 1
> > > > > call to ->iomap_begin() to get the mapping metadata, which will be 2
> > > > > context-switches (and if the server has ->iomap_end() implemented,
> > > > > then 2 more context-switches).
> > > >
> > > > Yes, I agree that's a lot of context switching for file IO...
> > > >
> > > > > But it seems like the logic for retrieving mapping
> > > > > offsets/lengths/metadata should be pretty straightforward?
> > > >
> > > > ...but it gets very cheap if the fuse server can cache mappings in the
> > > > kernel to avoid all that. That is, incidentally, what patchset #7
> > > > implements.
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2026-01-22
> > > >
> > > > > If the extent lookups are table lookups or tree
> > > > > traversals without complex side effects, then having
> > > > > ->iomap_begin()/->iomap_end() be executed as a bpf program would avoid
> > > > > the context switches and allow all the caching logic to be moved from
> > > > > the kernel to the server-side (eg using bpf maps).
> > > >
> > > > Hrmm. Now that /is/ an interesting proposal. Does BPF have a data
> > > > structure that supports interval mappings? I think the existing bpf map
> > >
> > > Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
> > > Maybe one workaround in the meantime that could work is using a sorted
> > > array map and doing binary search on that, until interval mappings can
> > > be natively supported?
> >
> > I guess, though I already had a C structure to borrow from xfs ;)
> >
> > > > only does key -> value. Also, is there an upper limit on the size of a
> > > > map? You could have hundreds of millions of maps for a very fragmented
> > > > regular file.
> > >
> > > If I'm remembering correctly, there's an upper limit on the number of
> > > map entries, which is bounded by u32
> >
> > That's problematic, since files can have 64-bit logical block numbers.
>
> The key size supports 64-bits. The u32 bound would be the limit on the
> number of extents for the file.
Oh, ok. If one treats the incore map as a cache and evicts things when
they get too old, then that would be fine. I misread that as an upper
limit on the *range* of the map entry keys. :/
As it stands, I need to figure out a way to trim the iomap btree when
memory gets tight. Right now it'll drop the cache whenever someone
closes the file, but that won't help for long-life processes that open a
heavily fragmented file and never close it.
A coding-intensive way to do that would be to register a shrinker and
deal with that, but ugh. A really stupid way would be to drop the whole
cache once you get beyond (say) 64k of memory usage (~2000 mappings).
> > > > At one point I suggested to the famfs maintainer that it might be
> > > > easier/better to implement the interleaved mapping lookups as bpf
> > > > programs instead of being stuck with a fixed format in the fuse
> > > > userspace abi, but I don't know if he ever implemented that.
> > >
> > > This seems like a good use case for it too
> > > >
> > > > > Is this your
> > > > > assessment of it as well or do you think the server-side logic for
> > > > > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > > > > Asking because I'm curious whether this direction makes sense, not
> > > > > because I think it would be a blocker for your series.
> > > >
> > > > For disk-based filesystems I think it would be difficult to model a bpf
> > > > program to do mappings, since they can basically point anywhere and be
> > > > of any size.
> > >
> > > Hmm I'm not familiar enough with disk-based filesystems to know what
> > > the "point anywhere and be of any size" means. For the mapping stuff,
> > > doesn't it just point to a block number? Or are you saying the problem
> > > would be there's too many mappings since a mapping could be any size?
> >
> > The second -- mappings can be any size, and unprivileged userspace can
> > control the mappings.
>
> If I'm understanding what you're saying here, this is the same
> discussion as the one above about the u32 bound, correct?
A different thing -- file data mappings are irregularly sized, can
contain sparse holes, etc. Userspace controls the size and offset of
each mapping record (thanks to magic things like fallocate) so it'd be
very difficult to create a bpf program to generate mappings on the fly.
Also you could have 2^33 mappings records for a file, so I think you
can't even write a bpf program that large.
> > > I was thinking the issue would be more that there might be other logic
> > > inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
> > > would need to be done that would be too out-of-scope for bpf. But I
> > > think I need to read through the fuse4fs stuff to understand more what
> > > it's doing in those functions.
>
> Looking at fuse4fs logic cursorily, it seems doable? What I like about
> offloading this to bpf too is it would also then allow John's famfs to
> just go through your iomap plumbing as a use case of it instead of
> being an entirely separate thing. Though maybe there's some other
> reason for that that you guys have discussed prior. In any case, I'll
> ask this on John's main famfs patchset. It kind of seems to me that
> you guys are pretty much doing the exact same thing conceptually.
Yes, though John's famfs has the nice property that memory controller
interleaving is mathematically regular and likely makes for a compact
bpf program.
--D
> Thanks,
> Joanne
>
> >
> > <nod>
> >
> > --D
> >
> > >
> > > Thanks,
> > > Joanne
> > >
> > > >
> > > > OTOH it would be enormously hilarious to me if one could load a file
> > > > mapping predictive model into the kernel as a bpf program and use that
> > > > as a first tier before checking the in-memory btree mapping cache from
> > > > patchset 7. Quite a few years ago now there was a FAST paper
> > > > establishing that even a stupid linear regression model could in theory
> > > > beat a disk btree lookup.
> > > >
> > > > --D
> > > >
> > > > > Thanks,
> > > > > Joanne
> > > > >
> > > > > >
> > > > > > If you're going to start using this code, I strongly recommend pulling
> > > > > > from my git trees, which are linked below.
> > > > > >
> > > > > > This has been running on the djcloud for months with no problems. Enjoy!
> > > > > > Comments and questions are, as always, welcome.
> > > > > >
> > > > > > --D
> > > > > >
> > > > > > kernel git tree:
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > > > > ---
> > > > > > Commits in this patchset:
> > > > > > * fuse: implement the basic iomap mechanisms
> > > > > > * fuse_trace: implement the basic iomap mechanisms
> > > > > > * fuse: make debugging configurable at runtime
> > > > > > * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > > > > * fuse: create a per-inode flag for toggling iomap
> > > > > > * fuse_trace: create a per-inode flag for toggling iomap
> > > > > > * fuse: isolate the other regular file IO paths from iomap
> > > > > > * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > * fuse: implement direct IO with iomap
> > > > > > * fuse_trace: implement direct IO with iomap
> > > > > > * fuse: implement buffered IO with iomap
> > > > > > * fuse_trace: implement buffered IO with iomap
> > > > > > * fuse: implement large folios for iomap pagecache files
> > > > > > * fuse: use an unrestricted backing device with iomap pagecache io
> > > > > > * fuse: advertise support for iomap
> > > > > > * fuse: query filesystem geometry when using iomap
> > > > > > * fuse_trace: query filesystem geometry when using iomap
> > > > > > * fuse: implement fadvise for iomap files
> > > > > > * fuse: invalidate ranges of block devices being used for iomap
> > > > > > * fuse_trace: invalidate ranges of block devices being used for iomap
> > > > > > * fuse: implement inline data file IO via iomap
> > > > > > * fuse_trace: implement inline data file IO via iomap
> > > > > > * fuse: allow more statx fields
> > > > > > * fuse: support atomic writes with iomap
> > > > > > * fuse_trace: support atomic writes with iomap
> > > > > > * fuse: disable direct reclaim for any fuse server that uses iomap
> > > > > > * fuse: enable swapfile activation on iomap
> > > > > > * fuse: implement freeze and shutdowns for iomap filesystems
> > > > > > ---
> > > > > > fs/fuse/fuse_i.h | 161 +++
> > > > > > fs/fuse/fuse_trace.h | 939 +++++++++++++++++++
> > > > > > fs/fuse/iomap_i.h | 52 +
> > > > > > include/uapi/linux/fuse.h | 219 ++++
> > > > > > fs/fuse/Kconfig | 48 +
> > > > > > fs/fuse/Makefile | 1
> > > > > > fs/fuse/backing.c | 12
> > > > > > fs/fuse/dev.c | 30 +
> > > > > > fs/fuse/dir.c | 120 ++
> > > > > > fs/fuse/file.c | 133 ++-
> > > > > > fs/fuse/file_iomap.c | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > fs/fuse/inode.c | 162 +++
> > > > > > fs/fuse/iomode.c | 2
> > > > > > fs/fuse/trace.c | 2
> > > > > > 14 files changed, 4056 insertions(+), 55 deletions(-)
> > > > > > create mode 100644 fs/fuse/iomap_i.h
> > > > > > create mode 100644 fs/fuse/file_iomap.c
> > > > > >
> > > > >
>
Powered by blists - more mailing lists