[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAJnrk1aBGx_FQ=_F-PaPshVKvyecdZZt4C0+z+XvNm6=tL0Y_Q@mail.gmail.com>
Date: Wed, 28 Jan 2026 17:12:54 -0800
From: Joanne Koong <joannelkoong@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
file IO performance
On Tue, Jan 27, 2026 at 4:34 PM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Tue, Jan 27, 2026 at 04:10:43PM -0800, Joanne Koong wrote:
> > On Tue, Jan 27, 2026 at 3:21 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > >
> > > On Tue, Jan 27, 2026 at 11:47:31AM -0800, Joanne Koong wrote:
> > > > On Mon, Jan 26, 2026 at 6:22 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > >
> > > > > On Mon, Jan 26, 2026 at 04:59:16PM -0800, Joanne Koong wrote:
> > > > > > On Tue, Oct 28, 2025 at 5:38 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > > >
> > > > > > > Hi all,
> > > > > > >
> > > > > > > This series connects fuse (the userspace filesystem layer) to fs-iomap
> > > > > > > to get fuse servers out of the business of handling file I/O themselves.
> > > > > > > By keeping the IO path mostly within the kernel, we can dramatically
> > > > > > > improve the speed of disk-based filesystems. This enables us to move
> > > > > > > all the filesystem metadata parsing code out of the kernel and into
> > > > > > > userspace, which means that we can containerize them for security
> > > > > > > without losing a lot of performance.
> > > > > >
> > > > > > I haven't looked through how the fuse2fs or fuse4fs servers are
> > > > > > implemented yet (also, could you explain the difference between the
> > > > > > two? Which one should we look at to see how it all ties together?),
> > > > >
> > > > > fuse4fs is a lowlevel fuse server; fuse2fs is a high(?) level fuse
> > > > > server. fuse4fs is the successor to fuse2fs, at least on Linux and BSD.
> > > >
> > > > Ah I see, thanks for the explanation. In that case, I'll just look at
> > > > fuse4fs then.
> > > >
> > > > >
> > > > > > but I wonder if having bpf infrastructure hooked up to fuse would be
> > > > > > especially helpful for what you're doing here with fuse iomap. afaict,
> > > > > > every read/write whether it's buffered or direct will incur at least 1
> > > > > > call to ->iomap_begin() to get the mapping metadata, which will be 2
> > > > > > context-switches (and if the server has ->iomap_end() implemented,
> > > > > > then 2 more context-switches).
> > > > >
> > > > > Yes, I agree that's a lot of context switching for file IO...
> > > > >
> > > > > > But it seems like the logic for retrieving mapping
> > > > > > offsets/lengths/metadata should be pretty straightforward?
> > > > >
> > > > > ...but it gets very cheap if the fuse server can cache mappings in the
> > > > > kernel to avoid all that. That is, incidentally, what patchset #7
> > > > > implements.
> > > > >
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2026-01-22
> > > > >
> > > > > > If the extent lookups are table lookups or tree
> > > > > > traversals without complex side effects, then having
> > > > > > ->iomap_begin()/->iomap_end() be executed as a bpf program would avoid
> > > > > > the context switches and allow all the caching logic to be moved from
> > > > > > the kernel to the server-side (eg using bpf maps).
> > > > >
> > > > > Hrmm. Now that /is/ an interesting proposal. Does BPF have a data
> > > > > structure that supports interval mappings? I think the existing bpf map
> > > >
> > > > Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
> > > > Maybe one workaround in the meantime that could work is using a sorted
> > > > array map and doing binary search on that, until interval mappings can
> > > > be natively supported?
> > >
> > > I guess, though I already had a C structure to borrow from xfs ;)
> > >
> > > > > only does key -> value. Also, is there an upper limit on the size of a
> > > > > map? You could have hundreds of millions of maps for a very fragmented
> > > > > regular file.
> > > >
> > > > If I'm remembering correctly, there's an upper limit on the number of
> > > > map entries, which is bounded by u32
> > >
> > > That's problematic, since files can have 64-bit logical block numbers.
> >
> > The key size supports 64-bits. The u32 bound would be the limit on the
> > number of extents for the file.
>
> Oh, ok. If one treats the incore map as a cache and evicts things when
> they get too old, then that would be fine. I misread that as an upper
> limit on the *range* of the map entry keys. :/
I think for more complicated servers, the bpf prog handling for
iomap_begin() would essentially just serve as a cache where if it's
not found in the cache, then it sends off the FUSE_IOMAP_BEGIN request
to the server. For servers that don't need as much complicated logic
(eg famfs), the iomap_begin() logic would just be executed within the
bpf prog itself.
>
> As it stands, I need to figure out a way to trim the iomap btree when
> memory gets tight. Right now it'll drop the cache whenever someone
> closes the file, but that won't help for long-life processes that open a
> heavily fragmented file and never close it.
>
> A coding-intensive way to do that would be to register a shrinker and
> deal with that, but ugh. A really stupid way would be to drop the whole
> cache once you get beyond (say) 64k of memory usage (~2000 mappings).
This kind of seems like another point in favor of giving userspace
control of the caching layer. They could then implement whatever
eviction policies they want.
It also allows them to prepopulate the cache upfront (eg when
servicing a file open request, if the file is below a certain size or
if the server knows what'll be hot, it could put those extents into
the map from the get-go).
in my opinion, the fuse-iomap layer should try to be as simple/minimal
and as generic as possible. I haven't read through iomap_cache.c yet
but the header comment suggests it's adapted from the xfs extent tree
cache. As I understand it, different filesystem implementations have
different caching architectures that are better suited for their use
cases (I'm guessing that's the case, otherwise there would just be one
general cache inside iomap all the filesystems would use?). It seems a
lot better to me to just let the userspace server define that
themselves. And selfishly from the fuse perspective, would be less
code we would have to maintain. And I guess too if some servers don't
need caching (like famfs?), they could avoid that overhead.
>
> > > > > At one point I suggested to the famfs maintainer that it might be
> > > > > easier/better to implement the interleaved mapping lookups as bpf
> > > > > programs instead of being stuck with a fixed format in the fuse
> > > > > userspace abi, but I don't know if he ever implemented that.
> > > >
> > > > This seems like a good use case for it too
> > > > >
> > > > > > Is this your
> > > > > > assessment of it as well or do you think the server-side logic for
> > > > > > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > > > > > Asking because I'm curious whether this direction makes sense, not
> > > > > > because I think it would be a blocker for your series.
> > > > >
> > > > > For disk-based filesystems I think it would be difficult to model a bpf
> > > > > program to do mappings, since they can basically point anywhere and be
> > > > > of any size.
> > > >
> > > > Hmm I'm not familiar enough with disk-based filesystems to know what
> > > > the "point anywhere and be of any size" means. For the mapping stuff,
> > > > doesn't it just point to a block number? Or are you saying the problem
> > > > would be there's too many mappings since a mapping could be any size?
> > >
> > > The second -- mappings can be any size, and unprivileged userspace can
> > > control the mappings.
> >
> > If I'm understanding what you're saying here, this is the same
> > discussion as the one above about the u32 bound, correct?
>
> A different thing -- file data mappings are irregularly sized, can
> contain sparse holes, etc. Userspace controls the size and offset of
> each mapping record (thanks to magic things like fallocate) so it'd be
> very difficult to create a bpf program to generate mappings on the fly.
Would the bpf prog have to generate mappings on the fly though? If the
userspace does things like fallocate, those operations would still go
through to the server as a regular request (eg FUSE_FALLOCATE) and on
the server side, it'd add that to the map dynamically from userspace.
>
> Also you could have 2^33 mappings records for a file, so I think you
> can't even write a bpf program that large.
I think this depends on what map structure gets used. If there is
native support added for b+ tree like data structures, I don't see why
it wouldn't be able to.
>
> > > > I was thinking the issue would be more that there might be other logic
> > > > inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
> > > > would need to be done that would be too out-of-scope for bpf. But I
> > > > think I need to read through the fuse4fs stuff to understand more what
> > > > it's doing in those functions.
> >
> > Looking at fuse4fs logic cursorily, it seems doable? What I like about
> > offloading this to bpf too is it would also then allow John's famfs to
> > just go through your iomap plumbing as a use case of it instead of
> > being an entirely separate thing. Though maybe there's some other
> > reason for that that you guys have discussed prior. In any case, I'll
> > ask this on John's main famfs patchset. It kind of seems to me that
> > you guys are pretty much doing the exact same thing conceptually.
>
> Yes, though John's famfs has the nice property that memory controller
> interleaving is mathematically regular and likely makes for a compact
> bpf program.
I tried out integrating the bpf hooks into fuse for iomap_begin() just
to see if it was realistic and it seems relatively straightforward so
far (though maybe the devil is in the details...). I used the
drivers/hid/bpf/hid_bpf_struct_ops.c program as a model for how to set
up the fuse bpf struct ops on the kernel side. calling it from
file_iomap.c looks something like
static int fuse_iomap_begin(...)
{
...
struct fuse_bpf_ops *bpf_ops = fuse_get_bpf_ops();
...
err = -EOPNOTSUPP;
if (bpf_ops && bpf_ops->iomap_begin)
err = bpf_ops->iomap_begin(inode, pos, len, flags, &outarg);
if (err)
err = fuse_simple_request(fm, &args);
...
}
and I was able to verify that iomap_begin() is able to return back
populated outarg fields from the bpf prog. If we were to actually
implement it i'm sure it'd be more complicated (eg we'd need to make
the fuse_bpf_ops registered per-connection, etc) but on the whole it
seems doable. My worry is that if we land the iomap cache patchset now
then we can't remove it in the future without breaking backwards
compatibility for being a performance regression (though maybe we can
since the fuse-iomap stuff is experimental?), so imo it'd be great if
we figured out what direction we want to go before landing the cache
stuff. And I think we need to have this conversation too on the main
famfs patchset (eg whether it should go through your general iomap
plumbing with bpf helpers vs. being a separate implementation) since
once that lands, it'd be irrevocable.
Thanks,
Joanne
>
> --D
>
> > Thanks,
> > Joanne
> >
> > >
> > > <nod>
> > >
> > > --D
> > >
> > > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > >
> > > > > OTOH it would be enormously hilarious to me if one could load a file
> > > > > mapping predictive model into the kernel as a bpf program and use that
> > > > > as a first tier before checking the in-memory btree mapping cache from
> > > > > patchset 7. Quite a few years ago now there was a FAST paper
> > > > > establishing that even a stupid linear regression model could in theory
> > > > > beat a disk btree lookup.
> > > > >
> > > > > --D
> > > > >
> > > > > > Thanks,
> > > > > > Joanne
> > > > > >
> > > > > > >
> > > > > > > If you're going to start using this code, I strongly recommend pulling
> > > > > > > from my git trees, which are linked below.
> > > > > > >
> > > > > > > This has been running on the djcloud for months with no problems. Enjoy!
> > > > > > > Comments and questions are, as always, welcome.
> > > > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > kernel git tree:
> > > > > > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > > > > > ---
> > > > > > > Commits in this patchset:
> > > > > > > * fuse: implement the basic iomap mechanisms
> > > > > > > * fuse_trace: implement the basic iomap mechanisms
> > > > > > > * fuse: make debugging configurable at runtime
> > > > > > > * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > > > > > * fuse: create a per-inode flag for toggling iomap
> > > > > > > * fuse_trace: create a per-inode flag for toggling iomap
> > > > > > > * fuse: isolate the other regular file IO paths from iomap
> > > > > > > * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > * fuse: implement direct IO with iomap
> > > > > > > * fuse_trace: implement direct IO with iomap
> > > > > > > * fuse: implement buffered IO with iomap
> > > > > > > * fuse_trace: implement buffered IO with iomap
> > > > > > > * fuse: implement large folios for iomap pagecache files
> > > > > > > * fuse: use an unrestricted backing device with iomap pagecache io
> > > > > > > * fuse: advertise support for iomap
> > > > > > > * fuse: query filesystem geometry when using iomap
> > > > > > > * fuse_trace: query filesystem geometry when using iomap
> > > > > > > * fuse: implement fadvise for iomap files
> > > > > > > * fuse: invalidate ranges of block devices being used for iomap
> > > > > > > * fuse_trace: invalidate ranges of block devices being used for iomap
> > > > > > > * fuse: implement inline data file IO via iomap
> > > > > > > * fuse_trace: implement inline data file IO via iomap
> > > > > > > * fuse: allow more statx fields
> > > > > > > * fuse: support atomic writes with iomap
> > > > > > > * fuse_trace: support atomic writes with iomap
> > > > > > > * fuse: disable direct reclaim for any fuse server that uses iomap
> > > > > > > * fuse: enable swapfile activation on iomap
> > > > > > > * fuse: implement freeze and shutdowns for iomap filesystems
> > > > > > > ---
> > > > > > > fs/fuse/fuse_i.h | 161 +++
> > > > > > > fs/fuse/fuse_trace.h | 939 +++++++++++++++++++
> > > > > > > fs/fuse/iomap_i.h | 52 +
> > > > > > > include/uapi/linux/fuse.h | 219 ++++
> > > > > > > fs/fuse/Kconfig | 48 +
> > > > > > > fs/fuse/Makefile | 1
> > > > > > > fs/fuse/backing.c | 12
> > > > > > > fs/fuse/dev.c | 30 +
> > > > > > > fs/fuse/dir.c | 120 ++
> > > > > > > fs/fuse/file.c | 133 ++-
> > > > > > > fs/fuse/file_iomap.c | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > fs/fuse/inode.c | 162 +++
> > > > > > > fs/fuse/iomode.c | 2
> > > > > > > fs/fuse/trace.c | 2
> > > > > > > 14 files changed, 4056 insertions(+), 55 deletions(-)
> > > > > > > create mode 100644 fs/fuse/iomap_i.h
> > > > > > > create mode 100644 fs/fuse/file_iomap.c
> > > > > > >
> > > > > >
> >
Powered by blists - more mailing lists