lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260129200254.GA7686@frogsfrogsfrogs>
Date: Thu, 29 Jan 2026 12:02:54 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
 file IO performance

On Wed, Jan 28, 2026 at 05:12:54PM -0800, Joanne Koong wrote:

<snip>

> > > > > > Hrmm.  Now that /is/ an interesting proposal.  Does BPF have a data
> > > > > > structure that supports interval mappings?  I think the existing bpf map
> > > > >
> > > > > Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
> > > > > Maybe one workaround in the meantime that could work is using a sorted
> > > > > array map and doing binary search on that, until interval mappings can
> > > > > be natively supported?
> > > >
> > > > I guess, though I already had a C structure to borrow from xfs ;)
> > > >
> > > > > > only does key -> value.  Also, is there an upper limit on the size of a
> > > > > > map?  You could have hundreds of millions of maps for a very fragmented
> > > > > > regular file.
> > > > >
> > > > > If I'm remembering correctly, there's an upper limit on the number of
> > > > > map entries, which is bounded by u32
> > > >
> > > > That's problematic, since files can have 64-bit logical block numbers.
> > >
> > > The key size supports 64-bits. The u32 bound would be the limit on the
> > > number of extents for the file.
> >
> > Oh, ok.  If one treats the incore map as a cache and evicts things when
> > they get too old, then that would be fine.  I misread that as an upper
> > limit on the *range* of the map entry keys. :/
> 
> I think for more complicated servers, the bpf prog handling for
> iomap_begin() would essentially just serve as a cache where if it's
> not found in the cache, then it sends off the FUSE_IOMAP_BEGIN request
> to the server. For servers that don't need as much complicated logic
> (eg famfs), the iomap_begin() logic would just be executed within the
> bpf prog itself.

Yes, I like the fuse_iomap_begin logic flow of:

1. Try to use a mapping in the iext tree
2. Call a BPF program to try to generate a mapping
3. Issue a fuse command to userspace

wherein #2 and #3 can signal that #1 should be retried.  (This is
already provided by FUSE_IOMAP_TYPE_RETRY_CACHE, FWIW)

That said, BPF doesn't expose an interval btree data structure.  I think
it would be better to add the iext mapping cache and make it so that bpf
programs could call fuse_iomap_cache_{upsert,remove,lookup}.  You could
use the interval tree too, but the iext tree has the advantage of higher
fanout factor.

> > As it stands, I need to figure out a way to trim the iomap btree when
> > memory gets tight.  Right now it'll drop the cache whenever someone
> > closes the file, but that won't help for long-life processes that open a
> > heavily fragmented file and never close it.
> >
> > A coding-intensive way to do that would be to register a shrinker and
> > deal with that, but ugh.  A really stupid way would be to drop the whole
> > cache once you get beyond (say) 64k of memory usage (~2000 mappings).
> 
> This kind of seems like another point in favor of giving userspace
> control of the caching layer. They could then implement whatever
> eviction policies they want.

Note that userspace already can control the cached iomappings --
FUSE_NOTIFY_IOMAP_UPSERT pushes a mapping into the iext tree, and
FUSE_NOTIFY_IOMAP_INVAL removes them.  The fuse server can decide to
evict whenever it pleases, though admittedly the iext tree doesn't track
usage information of any kind, so how would the fuse server know?

The static limit is merely the kernel's means to establish a hard limit
on the memory consumption of the iext tree, since it can't trust
userspace completely.

> It also allows them to prepopulate the cache upfront (eg when
> servicing a file open request, if the file is below a certain size or
> if the server knows what'll be hot, it could put those extents into
> the map from the get-go).

Hrm.  I haven't tried issuing FUSE_NOTIFY_IOMAP_UPSERT during an open
call, but I suppose it's possible.

> in my opinion, the fuse-iomap layer should try to be as simple/minimal
> and as generic as possible. I haven't read through iomap_cache.c yet
> but the header comment suggests it's adapted from the xfs extent tree

Rudely copied, not adapted ;)

I actually wonder if I should make a horrible macro to generate the
fuse_iext_* structures and functions, and then xfs_iext_tree.c and
fuse_iomap_cache.c can "share" that hairba^Wcode.

> cache. As I understand it, different filesystem implementations have
> different caching architectures that are better suited for their use
> cases

Err.  The way this evolved is ... way too long to go into in this email.
Here's a truncated version; I can tell you the full story next week.

Most filesystems store their file mapping data on disk in whatever
format the designers specified.  When the pagecache asks them to read
or write the cache, they attach buffer heads to the folio, fill out the
buffer heads with the minimum mapping information needed to map the
folios to disk addresses.  bios are constructed for each folio based on
what's in the bufferhead.

This was fine for filesystems that map each block individually, such as
FFS/ext2/ext3/fat...

> (I'm guessing that's the case, otherwise there would just be one
> general cache inside iomap all the filesystems would use?). It seems a

...but newer filesystems such as xfs/ext4/btrfs map a bunch of blocks at
a time.  Each of them invented their own private incore mapping
structures to mirror the ondisk structure.  xfs kept using the old
bufferheads into the early 2010s, ext4 is still using them, and btrfs
went its own way from the start.

Eventually XFS grew its own internal extent-to-bio mapping code that
flipped the model -- rather than get a pagecache folio, map the folio to
blocks, and issue IOs based on the blocks, it would get the file
mapping, grab folios for the whole mapping, and issue bios for the batch
of folios.  That's more efficient, but at this point we have a legacy
codebase problem for everything else in fs/.

In 2019, hch and I decided to export the extent-to-bio mapping code from
xfs so that new filesystems could start with something cleaner than
bufferheads.  In the past 7 years, nobody's added a new filesystem with
complex mapping requirements; they've only ported existing filesystems
to it, without further refactoring of their incore data structures.
That's why there's no generic iomap cache.

> lot better to me to just let the userspace server define that
> themselves. And selfishly from the fuse perspective, would be less

Well if I turned the iext code into a template then fuse would only need
enough glue code to declare a template class and use it.  The glue part
is only ... 230LOC.

> code we would have to maintain. And I guess too if some servers don't
> need caching (like famfs?), they could avoid that overhead.

Hrm.  Right now the struct fuse_iomap_cache is embedded in struct
fuse_inode, but that could be turned into a dynamic allocation.

> > > > > > At one point I suggested to the famfs maintainer that it might be
> > > > > > easier/better to implement the interleaved mapping lookups as bpf
> > > > > > programs instead of being stuck with a fixed format in the fuse
> > > > > > userspace abi, but I don't know if he ever implemented that.
> > > > >
> > > > > This seems like a good use case for it too
> > > > > >
> > > > > > > Is this your
> > > > > > > assessment of it as well or do you think the server-side logic for
> > > > > > > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > > > > > > Asking because I'm curious whether this direction makes sense, not
> > > > > > > because I think it would be a blocker for your series.
> > > > > >
> > > > > > For disk-based filesystems I think it would be difficult to model a bpf
> > > > > > program to do mappings, since they can basically point anywhere and be
> > > > > > of any size.
> > > > >
> > > > > Hmm I'm not familiar enough with disk-based filesystems to know what
> > > > > the "point anywhere and be of any size" means. For the mapping stuff,
> > > > > doesn't it just point to a block number? Or are you saying the problem
> > > > > would be there's too many mappings since a mapping could be any size?
> > > >
> > > > The second -- mappings can be any size, and unprivileged userspace can
> > > > control the mappings.
> > >
> > > If I'm understanding what you're saying here, this is the same
> > > discussion as the one above about the u32 bound, correct?
> >
> > A different thing -- file data mappings are irregularly sized, can
> > contain sparse holes, etc.  Userspace controls the size and offset of
> > each mapping record (thanks to magic things like fallocate) so it'd be
> > very difficult to create a bpf program to generate mappings on the fly.
> 
> Would the bpf prog have to generate mappings on the fly though? If the
> userspace does things like fallocate, those operations would still go
> through to the server as a regular request (eg FUSE_FALLOCATE) and on
> the server side, it'd add that to the map dynamically from userspace.

That depends on the fuse server design.  For simple things like famfs
where the layout is bog simple and there's no fancy features like
delayed allocation or unwritten extents, then you could probably get
away a BPF program to generate the entire mapping set.  I suspect an
object-store type filesystem (aka write a file once, close it, snapshot
it, and never change it again) might be good at landing all the file
data in relatively few extent mappings, and it could actually compile a
custom bpf program for that file and push it to the kernel.

> > Also you could have 2^33 mappings records for a file, so I think you
> > can't even write a bpf program that large.
> 
> I think this depends on what map structure gets used. If there is
> native support added for b+ tree like data structures, I don't see why
> it wouldn't be able to.

<nod>

> > > > > I was thinking the issue would be more that there might be other logic
> > > > > inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
> > > > > would need to be done that would be too out-of-scope for bpf. But I
> > > > > think I need to read through the fuse4fs stuff to understand more what
> > > > > it's doing in those functions.
> > >
> > > Looking at fuse4fs logic cursorily, it seems doable? What I like about
> > > offloading this to bpf too is it would also then allow John's famfs to
> > > just go through your iomap plumbing as a use case of it instead of
> > > being an entirely separate thing. Though maybe there's some other
> > > reason for that that you guys have discussed prior. In any case, I'll
> > > ask this on John's main famfs patchset. It kind of seems to me that
> > > you guys are pretty much doing the exact same thing conceptually.
> >
> > Yes, though John's famfs has the nice property that memory controller
> > interleaving is mathematically regular and likely makes for a compact
> > bpf program.
> 
> I tried out integrating the bpf hooks into fuse for iomap_begin() just
> to see if it was realistic and it seems relatively straightforward so
> far (though maybe the devil is in the details...). I used the

Ok, now *that's* interesting!  I guess I had better push the latest
fuse-iomap code ... but I cannot share a link, because I cannot get
through the @!#%%!!! kernel.org anubis bullcrap.

So I generated a pull request and I *think* this munged URL will work
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container_2026-01-29

Or I guess you could just git-pull this:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git tags/fuse-service-container_2026-01-29

> drivers/hid/bpf/hid_bpf_struct_ops.c program as a model for how to set
> up the fuse bpf struct ops on the kernel side. calling it from
> file_iomap.c looks something like
> 
> static int fuse_iomap_begin(...)
> {
>        ...
>        struct fuse_bpf_ops *bpf_ops = fuse_get_bpf_ops();
>        ...
>       err = -EOPNOTSUPP;
>       if (bpf_ops && bpf_ops->iomap_begin)
>                err = bpf_ops->iomap_begin(inode, pos, len, flags, &outarg);
>        if (err)
>                err = fuse_simple_request(fm, &args);
>       ...
> }

I'm curious what the rest of the bpf integration code looks like.

> and I was able to verify that iomap_begin() is able to return back
> populated outarg fields from the bpf prog. If we were to actually
> implement it i'm sure it'd be more complicated (eg we'd need to make
> the fuse_bpf_ops registered per-connection, etc) but on the whole it

What is a fuse_bpf_ops?  I'm assuming that's the attachment point for a
bpf program that the fuse server would compile?  In which case, yes, I
think that ought to be per-connection.

So the bpf program can examine the struct inode, and the pos/len/opflags
field; and from that information it has to write the appropriate fields
in &outarg?  That's new, I didn't think bpf was allowed to write to
kernel memory.  But it's been a few years since I last touched the bpf
internals.

Some bpf programs might just know how to fill out outarg on their own
(e.g. famfs memory interleaving) but other bpf programs might perform a
range query on some imaginary bpf-interval-tree wherein you can do a
fast lookup based on (inumber, pos, len)?

I guess that's an interesting question -- would each fuse connection
have one big bpf-interval-tree?  Or would you shard things by inode to
reduce contention?  And if you sharded like that, then would you need a
fuse_bpf_ops per inode?

(I'm imagining that the fuse_bpf_ops might be where you'd stash the root
of the bpf data structure, but I know nothing of bpf internals ;))

Rolling on: how easy is it for a userspace program to compile and upload
bpf programs into the kernel?  I've played around with bcc enough to
write some fairly stupid latency tracing tools for XFS, but at the end
of the day it still python scripts feeding a string full of maybe-C into
whatever the BPF does under the hood.

I /think/ it calls clang on the provided text, links that against the
current kernel's header files, and pushes the compiled bpf binary into
the kernel, right?  So fuse4fs would have to learn how to do that; and
now fuse4fs has a runtime dependency on libllvm.

And while I'm on the topic of fuse-bpf uapi: It's ok for us to expose
primitive-typed variables (pos/len/opflags) and existing fuse uapi
directly to a bpf program, but I don't think we should expose struct
inode/fuse_inode.  Maybe just fuse_inode::nodeid?  If we're careful not
to allow #include'ing structured types in the fuse bpf code, then
perhaps the bpf programs could be compiled at the same time as the fuse
server.

> seems doable. My worry is that if we land the iomap cache patchset now
> then we can't remove it in the future without breaking backwards
> compatibility for being a performance regression (though maybe we can
> since the fuse-iomap stuff is experimental?), so imo it'd be great if

I don't think it's a huge problem to remove functionality while the
EXPERIMENTAL warnings are in place.  We'd forever lose the command codes
for FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL, but we've only
used 12 out of INT_MAX so that's not likely to be a concern.

> we figured out what direction we want to go before landing the cache
> stuff. And I think we need to have this conversation too on the main
> famfs patchset (eg whether it should go through your general iomap
> plumbing with bpf helpers vs. being a separate implementation) since
> once that lands, it'd be irrevocable.

I've of two minds on that -- John got here first, so I don't want to
delay his patchset whilst I slowly work on this thing.  OTOH from an
architecture standpoint we probably ought to push for three ways for a
fuse server to upload mappings:

1. Upserting mappings with arbitrary offset and size into a cache
2. Self contained bpf program that can generate any mapping
3. Sprawling bpf program that can read any other artifacts that another
   bpf program might have set up for it

But yeah, let's involve John.

--D

> 
> Thanks,
> Joanne
> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > >
> > > >
> > > > <nod>
> > > >
> > > > --D
> > > >
> > > > >
> > > > > Thanks,
> > > > > Joanne
> > > > >
> > > > > >
> > > > > > OTOH it would be enormously hilarious to me if one could load a file
> > > > > > mapping predictive model into the kernel as a bpf program and use that
> > > > > > as a first tier before checking the in-memory btree mapping cache from
> > > > > > patchset 7.  Quite a few years ago now there was a FAST paper
> > > > > > establishing that even a stupid linear regression model could in theory
> > > > > > beat a disk btree lookup.
> > > > > >
> > > > > > --D
> > > > > >
> > > > > > > Thanks,
> > > > > > > Joanne
> > > > > > >
> > > > > > > >
> > > > > > > > If you're going to start using this code, I strongly recommend pulling
> > > > > > > > from my git trees, which are linked below.
> > > > > > > >
> > > > > > > > This has been running on the djcloud for months with no problems.  Enjoy!
> > > > > > > > Comments and questions are, as always, welcome.
> > > > > > > >
> > > > > > > > --D
> > > > > > > >
> > > > > > > > kernel git tree:
> > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > > > > > > ---
> > > > > > > > Commits in this patchset:
> > > > > > > >  * fuse: implement the basic iomap mechanisms
> > > > > > > >  * fuse_trace: implement the basic iomap mechanisms
> > > > > > > >  * fuse: make debugging configurable at runtime
> > > > > > > >  * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > >  * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > >  * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > > > > > >  * fuse: create a per-inode flag for toggling iomap
> > > > > > > >  * fuse_trace: create a per-inode flag for toggling iomap
> > > > > > > >  * fuse: isolate the other regular file IO paths from iomap
> > > > > > > >  * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > >  * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > >  * fuse: implement direct IO with iomap
> > > > > > > >  * fuse_trace: implement direct IO with iomap
> > > > > > > >  * fuse: implement buffered IO with iomap
> > > > > > > >  * fuse_trace: implement buffered IO with iomap
> > > > > > > >  * fuse: implement large folios for iomap pagecache files
> > > > > > > >  * fuse: use an unrestricted backing device with iomap pagecache io
> > > > > > > >  * fuse: advertise support for iomap
> > > > > > > >  * fuse: query filesystem geometry when using iomap
> > > > > > > >  * fuse_trace: query filesystem geometry when using iomap
> > > > > > > >  * fuse: implement fadvise for iomap files
> > > > > > > >  * fuse: invalidate ranges of block devices being used for iomap
> > > > > > > >  * fuse_trace: invalidate ranges of block devices being used for iomap
> > > > > > > >  * fuse: implement inline data file IO via iomap
> > > > > > > >  * fuse_trace: implement inline data file IO via iomap
> > > > > > > >  * fuse: allow more statx fields
> > > > > > > >  * fuse: support atomic writes with iomap
> > > > > > > >  * fuse_trace: support atomic writes with iomap
> > > > > > > >  * fuse: disable direct reclaim for any fuse server that uses iomap
> > > > > > > >  * fuse: enable swapfile activation on iomap
> > > > > > > >  * fuse: implement freeze and shutdowns for iomap filesystems
> > > > > > > > ---
> > > > > > > >  fs/fuse/fuse_i.h          |  161 +++
> > > > > > > >  fs/fuse/fuse_trace.h      |  939 +++++++++++++++++++
> > > > > > > >  fs/fuse/iomap_i.h         |   52 +
> > > > > > > >  include/uapi/linux/fuse.h |  219 ++++
> > > > > > > >  fs/fuse/Kconfig           |   48 +
> > > > > > > >  fs/fuse/Makefile          |    1
> > > > > > > >  fs/fuse/backing.c         |   12
> > > > > > > >  fs/fuse/dev.c             |   30 +
> > > > > > > >  fs/fuse/dir.c             |  120 ++
> > > > > > > >  fs/fuse/file.c            |  133 ++-
> > > > > > > >  fs/fuse/file_iomap.c      | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  fs/fuse/inode.c           |  162 +++
> > > > > > > >  fs/fuse/iomode.c          |    2
> > > > > > > >  fs/fuse/trace.c           |    2
> > > > > > > >  14 files changed, 4056 insertions(+), 55 deletions(-)
> > > > > > > >  create mode 100644 fs/fuse/iomap_i.h
> > > > > > > >  create mode 100644 fs/fuse/file_iomap.c
> > > > > > > >
> > > > > > >
> > >
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ