[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260129224122.GB7686@frogsfrogsfrogs>
Date: Thu, 29 Jan 2026 14:41:22 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
file IO performance
On Thu, Jan 29, 2026 at 12:02:54PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 28, 2026 at 05:12:54PM -0800, Joanne Koong wrote:
>
> <snip>
>
> > > > > > > Hrmm. Now that /is/ an interesting proposal. Does BPF have a data
> > > > > > > structure that supports interval mappings? I think the existing bpf map
> > > > > >
> > > > > > Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
> > > > > > Maybe one workaround in the meantime that could work is using a sorted
> > > > > > array map and doing binary search on that, until interval mappings can
> > > > > > be natively supported?
> > > > >
> > > > > I guess, though I already had a C structure to borrow from xfs ;)
> > > > >
> > > > > > > only does key -> value. Also, is there an upper limit on the size of a
> > > > > > > map? You could have hundreds of millions of maps for a very fragmented
> > > > > > > regular file.
> > > > > >
> > > > > > If I'm remembering correctly, there's an upper limit on the number of
> > > > > > map entries, which is bounded by u32
> > > > >
> > > > > That's problematic, since files can have 64-bit logical block numbers.
> > > >
> > > > The key size supports 64-bits. The u32 bound would be the limit on the
> > > > number of extents for the file.
> > >
> > > Oh, ok. If one treats the incore map as a cache and evicts things when
> > > they get too old, then that would be fine. I misread that as an upper
> > > limit on the *range* of the map entry keys. :/
> >
> > I think for more complicated servers, the bpf prog handling for
> > iomap_begin() would essentially just serve as a cache where if it's
> > not found in the cache, then it sends off the FUSE_IOMAP_BEGIN request
> > to the server. For servers that don't need as much complicated logic
> > (eg famfs), the iomap_begin() logic would just be executed within the
> > bpf prog itself.
>
> Yes, I like the fuse_iomap_begin logic flow of:
>
> 1. Try to use a mapping in the iext tree
> 2. Call a BPF program to try to generate a mapping
> 3. Issue a fuse command to userspace
>
> wherein #2 and #3 can signal that #1 should be retried. (This is
> already provided by FUSE_IOMAP_TYPE_RETRY_CACHE, FWIW)
>
> That said, BPF doesn't expose an interval btree data structure. I think
> it would be better to add the iext mapping cache and make it so that bpf
> programs could call fuse_iomap_cache_{upsert,remove,lookup}. You could
> use the interval tree too, but the iext tree has the advantage of higher
> fanout factor.
>
> > > As it stands, I need to figure out a way to trim the iomap btree when
> > > memory gets tight. Right now it'll drop the cache whenever someone
> > > closes the file, but that won't help for long-life processes that open a
> > > heavily fragmented file and never close it.
> > >
> > > A coding-intensive way to do that would be to register a shrinker and
> > > deal with that, but ugh. A really stupid way would be to drop the whole
> > > cache once you get beyond (say) 64k of memory usage (~2000 mappings).
> >
> > This kind of seems like another point in favor of giving userspace
> > control of the caching layer. They could then implement whatever
> > eviction policies they want.
>
> Note that userspace already can control the cached iomappings --
> FUSE_NOTIFY_IOMAP_UPSERT pushes a mapping into the iext tree, and
> FUSE_NOTIFY_IOMAP_INVAL removes them. The fuse server can decide to
> evict whenever it pleases, though admittedly the iext tree doesn't track
> usage information of any kind, so how would the fuse server know?
>
> The static limit is merely the kernel's means to establish a hard limit
> on the memory consumption of the iext tree, since it can't trust
> userspace completely.
>
> > It also allows them to prepopulate the cache upfront (eg when
> > servicing a file open request, if the file is below a certain size or
> > if the server knows what'll be hot, it could put those extents into
> > the map from the get-go).
>
> Hrm. I haven't tried issuing FUSE_NOTIFY_IOMAP_UPSERT during an open
> call, but I suppose it's possible.
>
> > in my opinion, the fuse-iomap layer should try to be as simple/minimal
> > and as generic as possible. I haven't read through iomap_cache.c yet
> > but the header comment suggests it's adapted from the xfs extent tree
>
> Rudely copied, not adapted ;)
>
> I actually wonder if I should make a horrible macro to generate the
> fuse_iext_* structures and functions, and then xfs_iext_tree.c and
> fuse_iomap_cache.c can "share" that hairba^Wcode.
I tried templatizing this with cpp macros and very rapidly lost all
sanity. :(
--D
> > cache. As I understand it, different filesystem implementations have
> > different caching architectures that are better suited for their use
> > cases
>
> Err. The way this evolved is ... way too long to go into in this email.
> Here's a truncated version; I can tell you the full story next week.
>
> Most filesystems store their file mapping data on disk in whatever
> format the designers specified. When the pagecache asks them to read
> or write the cache, they attach buffer heads to the folio, fill out the
> buffer heads with the minimum mapping information needed to map the
> folios to disk addresses. bios are constructed for each folio based on
> what's in the bufferhead.
>
> This was fine for filesystems that map each block individually, such as
> FFS/ext2/ext3/fat...
>
> > (I'm guessing that's the case, otherwise there would just be one
> > general cache inside iomap all the filesystems would use?). It seems a
>
> ...but newer filesystems such as xfs/ext4/btrfs map a bunch of blocks at
> a time. Each of them invented their own private incore mapping
> structures to mirror the ondisk structure. xfs kept using the old
> bufferheads into the early 2010s, ext4 is still using them, and btrfs
> went its own way from the start.
>
> Eventually XFS grew its own internal extent-to-bio mapping code that
> flipped the model -- rather than get a pagecache folio, map the folio to
> blocks, and issue IOs based on the blocks, it would get the file
> mapping, grab folios for the whole mapping, and issue bios for the batch
> of folios. That's more efficient, but at this point we have a legacy
> codebase problem for everything else in fs/.
>
> In 2019, hch and I decided to export the extent-to-bio mapping code from
> xfs so that new filesystems could start with something cleaner than
> bufferheads. In the past 7 years, nobody's added a new filesystem with
> complex mapping requirements; they've only ported existing filesystems
> to it, without further refactoring of their incore data structures.
> That's why there's no generic iomap cache.
>
> > lot better to me to just let the userspace server define that
> > themselves. And selfishly from the fuse perspective, would be less
>
> Well if I turned the iext code into a template then fuse would only need
> enough glue code to declare a template class and use it. The glue part
> is only ... 230LOC.
>
> > code we would have to maintain. And I guess too if some servers don't
> > need caching (like famfs?), they could avoid that overhead.
>
> Hrm. Right now the struct fuse_iomap_cache is embedded in struct
> fuse_inode, but that could be turned into a dynamic allocation.
>
> > > > > > > At one point I suggested to the famfs maintainer that it might be
> > > > > > > easier/better to implement the interleaved mapping lookups as bpf
> > > > > > > programs instead of being stuck with a fixed format in the fuse
> > > > > > > userspace abi, but I don't know if he ever implemented that.
> > > > > >
> > > > > > This seems like a good use case for it too
> > > > > > >
> > > > > > > > Is this your
> > > > > > > > assessment of it as well or do you think the server-side logic for
> > > > > > > > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > > > > > > > Asking because I'm curious whether this direction makes sense, not
> > > > > > > > because I think it would be a blocker for your series.
> > > > > > >
> > > > > > > For disk-based filesystems I think it would be difficult to model a bpf
> > > > > > > program to do mappings, since they can basically point anywhere and be
> > > > > > > of any size.
> > > > > >
> > > > > > Hmm I'm not familiar enough with disk-based filesystems to know what
> > > > > > the "point anywhere and be of any size" means. For the mapping stuff,
> > > > > > doesn't it just point to a block number? Or are you saying the problem
> > > > > > would be there's too many mappings since a mapping could be any size?
> > > > >
> > > > > The second -- mappings can be any size, and unprivileged userspace can
> > > > > control the mappings.
> > > >
> > > > If I'm understanding what you're saying here, this is the same
> > > > discussion as the one above about the u32 bound, correct?
> > >
> > > A different thing -- file data mappings are irregularly sized, can
> > > contain sparse holes, etc. Userspace controls the size and offset of
> > > each mapping record (thanks to magic things like fallocate) so it'd be
> > > very difficult to create a bpf program to generate mappings on the fly.
> >
> > Would the bpf prog have to generate mappings on the fly though? If the
> > userspace does things like fallocate, those operations would still go
> > through to the server as a regular request (eg FUSE_FALLOCATE) and on
> > the server side, it'd add that to the map dynamically from userspace.
>
> That depends on the fuse server design. For simple things like famfs
> where the layout is bog simple and there's no fancy features like
> delayed allocation or unwritten extents, then you could probably get
> away a BPF program to generate the entire mapping set. I suspect an
> object-store type filesystem (aka write a file once, close it, snapshot
> it, and never change it again) might be good at landing all the file
> data in relatively few extent mappings, and it could actually compile a
> custom bpf program for that file and push it to the kernel.
>
> > > Also you could have 2^33 mappings records for a file, so I think you
> > > can't even write a bpf program that large.
> >
> > I think this depends on what map structure gets used. If there is
> > native support added for b+ tree like data structures, I don't see why
> > it wouldn't be able to.
>
> <nod>
>
> > > > > > I was thinking the issue would be more that there might be other logic
> > > > > > inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
> > > > > > would need to be done that would be too out-of-scope for bpf. But I
> > > > > > think I need to read through the fuse4fs stuff to understand more what
> > > > > > it's doing in those functions.
> > > >
> > > > Looking at fuse4fs logic cursorily, it seems doable? What I like about
> > > > offloading this to bpf too is it would also then allow John's famfs to
> > > > just go through your iomap plumbing as a use case of it instead of
> > > > being an entirely separate thing. Though maybe there's some other
> > > > reason for that that you guys have discussed prior. In any case, I'll
> > > > ask this on John's main famfs patchset. It kind of seems to me that
> > > > you guys are pretty much doing the exact same thing conceptually.
> > >
> > > Yes, though John's famfs has the nice property that memory controller
> > > interleaving is mathematically regular and likely makes for a compact
> > > bpf program.
> >
> > I tried out integrating the bpf hooks into fuse for iomap_begin() just
> > to see if it was realistic and it seems relatively straightforward so
> > far (though maybe the devil is in the details...). I used the
>
> Ok, now *that's* interesting! I guess I had better push the latest
> fuse-iomap code ... but I cannot share a link, because I cannot get
> through the @!#%%!!! kernel.org anubis bullcrap.
>
> So I generated a pull request and I *think* this munged URL will work
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container_2026-01-29
>
> Or I guess you could just git-pull this:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git tags/fuse-service-container_2026-01-29
>
> > drivers/hid/bpf/hid_bpf_struct_ops.c program as a model for how to set
> > up the fuse bpf struct ops on the kernel side. calling it from
> > file_iomap.c looks something like
> >
> > static int fuse_iomap_begin(...)
> > {
> > ...
> > struct fuse_bpf_ops *bpf_ops = fuse_get_bpf_ops();
> > ...
> > err = -EOPNOTSUPP;
> > if (bpf_ops && bpf_ops->iomap_begin)
> > err = bpf_ops->iomap_begin(inode, pos, len, flags, &outarg);
> > if (err)
> > err = fuse_simple_request(fm, &args);
> > ...
> > }
>
> I'm curious what the rest of the bpf integration code looks like.
>
> > and I was able to verify that iomap_begin() is able to return back
> > populated outarg fields from the bpf prog. If we were to actually
> > implement it i'm sure it'd be more complicated (eg we'd need to make
> > the fuse_bpf_ops registered per-connection, etc) but on the whole it
>
> What is a fuse_bpf_ops? I'm assuming that's the attachment point for a
> bpf program that the fuse server would compile? In which case, yes, I
> think that ought to be per-connection.
>
> So the bpf program can examine the struct inode, and the pos/len/opflags
> field; and from that information it has to write the appropriate fields
> in &outarg? That's new, I didn't think bpf was allowed to write to
> kernel memory. But it's been a few years since I last touched the bpf
> internals.
>
> Some bpf programs might just know how to fill out outarg on their own
> (e.g. famfs memory interleaving) but other bpf programs might perform a
> range query on some imaginary bpf-interval-tree wherein you can do a
> fast lookup based on (inumber, pos, len)?
>
> I guess that's an interesting question -- would each fuse connection
> have one big bpf-interval-tree? Or would you shard things by inode to
> reduce contention? And if you sharded like that, then would you need a
> fuse_bpf_ops per inode?
>
> (I'm imagining that the fuse_bpf_ops might be where you'd stash the root
> of the bpf data structure, but I know nothing of bpf internals ;))
>
> Rolling on: how easy is it for a userspace program to compile and upload
> bpf programs into the kernel? I've played around with bcc enough to
> write some fairly stupid latency tracing tools for XFS, but at the end
> of the day it still python scripts feeding a string full of maybe-C into
> whatever the BPF does under the hood.
>
> I /think/ it calls clang on the provided text, links that against the
> current kernel's header files, and pushes the compiled bpf binary into
> the kernel, right? So fuse4fs would have to learn how to do that; and
> now fuse4fs has a runtime dependency on libllvm.
>
> And while I'm on the topic of fuse-bpf uapi: It's ok for us to expose
> primitive-typed variables (pos/len/opflags) and existing fuse uapi
> directly to a bpf program, but I don't think we should expose struct
> inode/fuse_inode. Maybe just fuse_inode::nodeid? If we're careful not
> to allow #include'ing structured types in the fuse bpf code, then
> perhaps the bpf programs could be compiled at the same time as the fuse
> server.
>
> > seems doable. My worry is that if we land the iomap cache patchset now
> > then we can't remove it in the future without breaking backwards
> > compatibility for being a performance regression (though maybe we can
> > since the fuse-iomap stuff is experimental?), so imo it'd be great if
>
> I don't think it's a huge problem to remove functionality while the
> EXPERIMENTAL warnings are in place. We'd forever lose the command codes
> for FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL, but we've only
> used 12 out of INT_MAX so that's not likely to be a concern.
>
> > we figured out what direction we want to go before landing the cache
> > stuff. And I think we need to have this conversation too on the main
> > famfs patchset (eg whether it should go through your general iomap
> > plumbing with bpf helpers vs. being a separate implementation) since
> > once that lands, it'd be irrevocable.
>
> I've of two minds on that -- John got here first, so I don't want to
> delay his patchset whilst I slowly work on this thing. OTOH from an
> architecture standpoint we probably ought to push for three ways for a
> fuse server to upload mappings:
>
> 1. Upserting mappings with arbitrary offset and size into a cache
> 2. Self contained bpf program that can generate any mapping
> 3. Sprawling bpf program that can read any other artifacts that another
> bpf program might have set up for it
>
> But yeah, let's involve John.
>
> --D
>
> >
> > Thanks,
> > Joanne
> > >
> > > --D
> > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > >
> > > > > <nod>
> > > > >
> > > > > --D
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Joanne
> > > > > >
> > > > > > >
> > > > > > > OTOH it would be enormously hilarious to me if one could load a file
> > > > > > > mapping predictive model into the kernel as a bpf program and use that
> > > > > > > as a first tier before checking the in-memory btree mapping cache from
> > > > > > > patchset 7. Quite a few years ago now there was a FAST paper
> > > > > > > establishing that even a stupid linear regression model could in theory
> > > > > > > beat a disk btree lookup.
> > > > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Joanne
> > > > > > > >
> > > > > > > > >
> > > > > > > > > If you're going to start using this code, I strongly recommend pulling
> > > > > > > > > from my git trees, which are linked below.
> > > > > > > > >
> > > > > > > > > This has been running on the djcloud for months with no problems. Enjoy!
> > > > > > > > > Comments and questions are, as always, welcome.
> > > > > > > > >
> > > > > > > > > --D
> > > > > > > > >
> > > > > > > > > kernel git tree:
> > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > > > > > > > ---
> > > > > > > > > Commits in this patchset:
> > > > > > > > > * fuse: implement the basic iomap mechanisms
> > > > > > > > > * fuse_trace: implement the basic iomap mechanisms
> > > > > > > > > * fuse: make debugging configurable at runtime
> > > > > > > > > * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > > > * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > > > * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > > > > > > > * fuse: create a per-inode flag for toggling iomap
> > > > > > > > > * fuse_trace: create a per-inode flag for toggling iomap
> > > > > > > > > * fuse: isolate the other regular file IO paths from iomap
> > > > > > > > > * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > > > * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > > > * fuse: implement direct IO with iomap
> > > > > > > > > * fuse_trace: implement direct IO with iomap
> > > > > > > > > * fuse: implement buffered IO with iomap
> > > > > > > > > * fuse_trace: implement buffered IO with iomap
> > > > > > > > > * fuse: implement large folios for iomap pagecache files
> > > > > > > > > * fuse: use an unrestricted backing device with iomap pagecache io
> > > > > > > > > * fuse: advertise support for iomap
> > > > > > > > > * fuse: query filesystem geometry when using iomap
> > > > > > > > > * fuse_trace: query filesystem geometry when using iomap
> > > > > > > > > * fuse: implement fadvise for iomap files
> > > > > > > > > * fuse: invalidate ranges of block devices being used for iomap
> > > > > > > > > * fuse_trace: invalidate ranges of block devices being used for iomap
> > > > > > > > > * fuse: implement inline data file IO via iomap
> > > > > > > > > * fuse_trace: implement inline data file IO via iomap
> > > > > > > > > * fuse: allow more statx fields
> > > > > > > > > * fuse: support atomic writes with iomap
> > > > > > > > > * fuse_trace: support atomic writes with iomap
> > > > > > > > > * fuse: disable direct reclaim for any fuse server that uses iomap
> > > > > > > > > * fuse: enable swapfile activation on iomap
> > > > > > > > > * fuse: implement freeze and shutdowns for iomap filesystems
> > > > > > > > > ---
> > > > > > > > > fs/fuse/fuse_i.h | 161 +++
> > > > > > > > > fs/fuse/fuse_trace.h | 939 +++++++++++++++++++
> > > > > > > > > fs/fuse/iomap_i.h | 52 +
> > > > > > > > > include/uapi/linux/fuse.h | 219 ++++
> > > > > > > > > fs/fuse/Kconfig | 48 +
> > > > > > > > > fs/fuse/Makefile | 1
> > > > > > > > > fs/fuse/backing.c | 12
> > > > > > > > > fs/fuse/dev.c | 30 +
> > > > > > > > > fs/fuse/dir.c | 120 ++
> > > > > > > > > fs/fuse/file.c | 133 ++-
> > > > > > > > > fs/fuse/file_iomap.c | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > > fs/fuse/inode.c | 162 +++
> > > > > > > > > fs/fuse/iomode.c | 2
> > > > > > > > > fs/fuse/trace.c | 2
> > > > > > > > > 14 files changed, 4056 insertions(+), 55 deletions(-)
> > > > > > > > > create mode 100644 fs/fuse/iomap_i.h
> > > > > > > > > create mode 100644 fs/fuse/file_iomap.c
> > > > > > > > >
> > > > > > > >
> > > >
> >
>
Powered by blists - more mailing lists