linux-ext4 - Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better file IO performance

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAJnrk1ag3ffQC=U1ZXVLTipDyo1VBQBM3MYNB6=6d4ywLOEieA@mail.gmail.com>
Date: Thu, 29 Jan 2026 14:50:23 -0800
From: Joanne Koong <joannelkoong@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev, 
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
 file IO performance

On Thu, Jan 29, 2026 at 12:02 PM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Wed, Jan 28, 2026 at 05:12:54PM -0800, Joanne Koong wrote:
>
> <snip>
>
> > > > > > > Hrmm.  Now that /is/ an interesting proposal.  Does BPF have a data
> > > > > > > structure that supports interval mappings?  I think the existing bpf map
> > > > > >
> > > > > > Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
> > > > > > Maybe one workaround in the meantime that could work is using a sorted
> > > > > > array map and doing binary search on that, until interval mappings can
> > > > > > be natively supported?
> > > > >
> > > > > I guess, though I already had a C structure to borrow from xfs ;)
> > > > >
> > > > > > > only does key -> value.  Also, is there an upper limit on the size of a
> > > > > > > map?  You could have hundreds of millions of maps for a very fragmented
> > > > > > > regular file.
> > > > > >
> > > > > > If I'm remembering correctly, there's an upper limit on the number of
> > > > > > map entries, which is bounded by u32
> > > > >
> > > > > That's problematic, since files can have 64-bit logical block numbers.
> > > >
> > > > The key size supports 64-bits. The u32 bound would be the limit on the
> > > > number of extents for the file.
> > >
> > > Oh, ok.  If one treats the incore map as a cache and evicts things when
> > > they get too old, then that would be fine.  I misread that as an upper
> > > limit on the *range* of the map entry keys. :/
> >
> > I think for more complicated servers, the bpf prog handling for
> > iomap_begin() would essentially just serve as a cache where if it's
> > not found in the cache, then it sends off the FUSE_IOMAP_BEGIN request
> > to the server. For servers that don't need as much complicated logic
> > (eg famfs), the iomap_begin() logic would just be executed within the
> > bpf prog itself.
>
> Yes, I like the fuse_iomap_begin logic flow of:
>
> 1. Try to use a mapping in the iext tree
> 2. Call a BPF program to try to generate a mapping
> 3. Issue a fuse command to userspace
>
> wherein #2 and #3 can signal that #1 should be retried.  (This is
> already provided by FUSE_IOMAP_TYPE_RETRY_CACHE, FWIW)
>
> That said, BPF doesn't expose an interval btree data structure.  I think
> it would be better to add the iext mapping cache and make it so that bpf
> programs could call fuse_iomap_cache_{upsert,remove,lookup}.  You could
> use the interval tree too, but the iext tree has the advantage of higher
> fanout factor.
>
> > > As it stands, I need to figure out a way to trim the iomap btree when
> > > memory gets tight.  Right now it'll drop the cache whenever someone
> > > closes the file, but that won't help for long-life processes that open a
> > > heavily fragmented file and never close it.
> > >
> > > A coding-intensive way to do that would be to register a shrinker and
> > > deal with that, but ugh.  A really stupid way would be to drop the whole
> > > cache once you get beyond (say) 64k of memory usage (~2000 mappings).
> >
> > This kind of seems like another point in favor of giving userspace
> > control of the caching layer. They could then implement whatever
> > eviction policies they want.
>
> Note that userspace already can control the cached iomappings --
> FUSE_NOTIFY_IOMAP_UPSERT pushes a mapping into the iext tree, and
> FUSE_NOTIFY_IOMAP_INVAL removes them.  The fuse server can decide to

This incurs round-trip context-switch costs though, which the bpf prog
approach doesn't incur.

> evict whenever it pleases, though admittedly the iext tree doesn't track
> usage information of any kind, so how would the fuse server know?
>
> The static limit is merely the kernel's means to establish a hard limit
> on the memory consumption of the iext tree, since it can't trust
> userspace completely.
>
> > It also allows them to prepopulate the cache upfront (eg when
> > servicing a file open request, if the file is below a certain size or
> > if the server knows what'll be hot, it could put those extents into
> > the map from the get-go).
>
> Hrm.  I haven't tried issuing FUSE_NOTIFY_IOMAP_UPSERT during an open
> call, but I suppose it's possible.
>
> > in my opinion, the fuse-iomap layer should try to be as simple/minimal
> > and as generic as possible. I haven't read through iomap_cache.c yet
> > but the header comment suggests it's adapted from the xfs extent tree
>
> Rudely copied, not adapted ;)
>
> I actually wonder if I should make a horrible macro to generate the
> fuse_iext_* structures and functions, and then xfs_iext_tree.c and
> fuse_iomap_cache.c can "share" that hairba^Wcode.
>
> > cache. As I understand it, different filesystem implementations have
> > different caching architectures that are better suited for their use
> > cases
>
> Err.  The way this evolved is ... way too long to go into in this email.
> Here's a truncated version; I can tell you the full story next week.
>
> Most filesystems store their file mapping data on disk in whatever
> format the designers specified.  When the pagecache asks them to read
> or write the cache, they attach buffer heads to the folio, fill out the
> buffer heads with the minimum mapping information needed to map the
> folios to disk addresses.  bios are constructed for each folio based on
> what's in the bufferhead.
>
> This was fine for filesystems that map each block individually, such as
> FFS/ext2/ext3/fat...
>
> > (I'm guessing that's the case, otherwise there would just be one
> > general cache inside iomap all the filesystems would use?). It seems a
>
> ...but newer filesystems such as xfs/ext4/btrfs map a bunch of blocks at
> a time.  Each of them invented their own private incore mapping
> structures to mirror the ondisk structure.  xfs kept using the old
> bufferheads into the early 2010s, ext4 is still using them, and btrfs
> went its own way from the start.
>
> Eventually XFS grew its own internal extent-to-bio mapping code that
> flipped the model -- rather than get a pagecache folio, map the folio to
> blocks, and issue IOs based on the blocks, it would get the file
> mapping, grab folios for the whole mapping, and issue bios for the batch
> of folios.  That's more efficient, but at this point we have a legacy
> codebase problem for everything else in fs/.
>
> In 2019, hch and I decided to export the extent-to-bio mapping code from
> xfs so that new filesystems could start with something cleaner than
> bufferheads.  In the past 7 years, nobody's added a new filesystem with
> complex mapping requirements; they've only ported existing filesystems
> to it, without further refactoring of their incore data structures.
> That's why there's no generic iomap cache.

Oh I see, so it actually *is* a generic cache? Just that the other
filesystems haven't ported over to it yet? That changes my opinion a
lot on this then. If it's a generic cache that pretty much any modern
filesystem should use, then it seems reasonable to me to have it on
the fuse iomap kernel side. Though in that case, it seems a lot
cleaner imo if the cache could be ported to the iomap layer as a
generic cache (which seems like it'd be useful anyways for other
filesystems to use if/when they port over to it, if I'm understanding
what you're saying correctly), and then fuse just call into that api.

>
> > lot better to me to just let the userspace server define that
> > themselves. And selfishly from the fuse perspective, would be less
>
> Well if I turned the iext code into a template then fuse would only need
> enough glue code to declare a template class and use it.  The glue part
> is only ... 230LOC.

Nice, I think this would be a lot nicer / less of a headache in the
future for fuse to maintain.

>
> > code we would have to maintain. And I guess too if some servers don't
> > need caching (like famfs?), they could avoid that overhead.
>
> Hrm.  Right now the struct fuse_iomap_cache is embedded in struct
> fuse_inode, but that could be turned into a dynamic allocation.
>
> > > > > > > At one point I suggested to the famfs maintainer that it might be
> > > > > > > easier/better to implement the interleaved mapping lookups as bpf
> > > > > > > programs instead of being stuck with a fixed format in the fuse
> > > > > > > userspace abi, but I don't know if he ever implemented that.
> > > > > >
> > > > > > This seems like a good use case for it too
> > > > > > >
> > > > > > > > Is this your
> > > > > > > > assessment of it as well or do you think the server-side logic for
> > > > > > > > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > > > > > > > Asking because I'm curious whether this direction makes sense, not
> > > > > > > > because I think it would be a blocker for your series.
> > > > > > >
> > > > > > > For disk-based filesystems I think it would be difficult to model a bpf
> > > > > > > program to do mappings, since they can basically point anywhere and be
> > > > > > > of any size.
> > > > > >
> > > > > > Hmm I'm not familiar enough with disk-based filesystems to know what
> > > > > > the "point anywhere and be of any size" means. For the mapping stuff,
> > > > > > doesn't it just point to a block number? Or are you saying the problem
> > > > > > would be there's too many mappings since a mapping could be any size?
> > > > >
> > > > > The second -- mappings can be any size, and unprivileged userspace can
> > > > > control the mappings.
> > > >
> > > > If I'm understanding what you're saying here, this is the same
> > > > discussion as the one above about the u32 bound, correct?
> > >
> > > A different thing -- file data mappings are irregularly sized, can
> > > contain sparse holes, etc.  Userspace controls the size and offset of
> > > each mapping record (thanks to magic things like fallocate) so it'd be
> > > very difficult to create a bpf program to generate mappings on the fly.
> >
> > Would the bpf prog have to generate mappings on the fly though? If the
> > userspace does things like fallocate, those operations would still go
> > through to the server as a regular request (eg FUSE_FALLOCATE) and on
> > the server side, it'd add that to the map dynamically from userspace.
>
> That depends on the fuse server design.  For simple things like famfs
> where the layout is bog simple and there's no fancy features like
> delayed allocation or unwritten extents, then you could probably get
> away a BPF program to generate the entire mapping set.  I suspect an
> object-store type filesystem (aka write a file once, close it, snapshot
> it, and never change it again) might be good at landing all the file
> data in relatively few extent mappings, and it could actually compile a
> custom bpf program for that file and push it to the kernel.
>
> > > Also you could have 2^33 mappings records for a file, so I think you
> > > can't even write a bpf program that large.
> >
> > I think this depends on what map structure gets used. If there is
> > native support added for b+ tree like data structures, I don't see why
> > it wouldn't be able to.
>
> <nod>
>
> > > > > > I was thinking the issue would be more that there might be other logic
> > > > > > inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
> > > > > > would need to be done that would be too out-of-scope for bpf. But I
> > > > > > think I need to read through the fuse4fs stuff to understand more what
> > > > > > it's doing in those functions.
> > > >
> > > > Looking at fuse4fs logic cursorily, it seems doable? What I like about
> > > > offloading this to bpf too is it would also then allow John's famfs to
> > > > just go through your iomap plumbing as a use case of it instead of
> > > > being an entirely separate thing. Though maybe there's some other
> > > > reason for that that you guys have discussed prior. In any case, I'll
> > > > ask this on John's main famfs patchset. It kind of seems to me that
> > > > you guys are pretty much doing the exact same thing conceptually.
> > >
> > > Yes, though John's famfs has the nice property that memory controller
> > > interleaving is mathematically regular and likely makes for a compact
> > > bpf program.
> >
> > I tried out integrating the bpf hooks into fuse for iomap_begin() just
> > to see if it was realistic and it seems relatively straightforward so
> > far (though maybe the devil is in the details...). I used the
>
> Ok, now *that's* interesting!  I guess I had better push the latest
> fuse-iomap code ... but I cannot share a link, because I cannot get
> through the @!#%%!!! kernel.org anubis bullcrap.
>
> So I generated a pull request and I *think* this munged URL will work
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-service-container_2026-01-29
>
> Or I guess you could just git-pull this:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git tags/fuse-service-container_2026-01-29
>
> > drivers/hid/bpf/hid_bpf_struct_ops.c program as a model for how to set
> > up the fuse bpf struct ops on the kernel side. calling it from
> > file_iomap.c looks something like
> >
> > static int fuse_iomap_begin(...)
> > {
> >        ...
> >        struct fuse_bpf_ops *bpf_ops = fuse_get_bpf_ops();
> >        ...
> >       err = -EOPNOTSUPP;
> >       if (bpf_ops && bpf_ops->iomap_begin)
> >                err = bpf_ops->iomap_begin(inode, pos, len, flags, &outarg);
> >        if (err)
> >                err = fuse_simple_request(fm, &args);
> >       ...
> > }
>
> I'm curious what the rest of the bpf integration code looks like.

This is the code I had yesterday (I didn't know how to run the fuse4fs
stuff, so I used passthrough_hp as the server and had the trigger go
through statfs):


diff --git a/fs/fuse/Makefile b/fs/fuse/Makefile
index 22ad9538dfc4..10c3939f4cf3 100644
--- a/fs/fuse/Makefile
+++ b/fs/fuse/Makefile
@@ -13,6 +13,7 @@ obj-$(CONFIG_VIRTIO_FS) += virtiofs.o
 fuse-y := trace.o      # put trace.o first so we see ftrace errors sooner
 fuse-y += dev.o dir.o file.o inode.o control.o xattr.o acl.o readdir.o ioctl.o
 fuse-y += iomode.o
+fuse-y += fuse_bpf.o
 fuse-$(CONFIG_FUSE_DAX) += dax.o
 fuse-$(CONFIG_FUSE_PASSTHROUGH) += passthrough.o backing.o
 fuse-$(CONFIG_SYSCTL) += sysctl.o
diff --git a/fs/fuse/fuse_bpf.c b/fs/fuse/fuse_bpf.c
new file mode 100644
index 000000000000..637cf152e997
--- /dev/null
+++ b/fs/fuse/fuse_bpf.c
@@ -0,0 +1,123 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+
+#include "fuse_i.h"
+#include "fuse_dev_i.h"
+#include "fuse_bpf.h"
+
+static struct btf *fuse_bpf_ops_btf;
+static struct fuse_bpf_ops *active_bpf_ops;
+
+static int fuse_bpf_ops_init(struct btf *btf)
+{
+       fuse_bpf_ops_btf = btf;
+       return 0;
+}
+
+static bool fuse_bpf_ops_is_valid_access(int off, int size,
+                                         enum bpf_access_type type,
+                                         const struct bpf_prog *prog,
+                                         struct bpf_insn_access_aux *info)
+{
+       return bpf_tracing_btf_ctx_access(off, size, type, prog, info);
+}
+
+static int fuse_bpf_ops_check_member(const struct btf_type *t,
+                                     const struct btf_member *member,
+                                     const struct bpf_prog *prog)
+{
+       return 0;
+}
+
+static int fuse_bpf_ops_btf_struct_access(struct bpf_verifier_log *log,
+                                          const struct bpf_reg_state *reg,
+                                          int off, int size)
+{
+       return 0;
+}
+
+static const struct bpf_verifier_ops fuse_bpf_verifier_ops = {
+       .get_func_proto = bpf_base_func_proto,
+       .is_valid_access = fuse_bpf_ops_is_valid_access,
+       .btf_struct_access = fuse_bpf_ops_btf_struct_access,
+};
+
+static int fuse_bpf_ops_init_member(const struct btf_type *t,
+                                   const struct btf_member *member,
+                                   void *kdata, const void *udata)
+{
+    const struct fuse_bpf_ops *u_ops = udata;
+    struct fuse_bpf_ops *ops = kdata;
+    u32 moff;
+
+    moff = __btf_member_bit_offset(t, member) / 8;
+    switch (moff) {
+    case offsetof(struct fuse_bpf_ops, name):
+        if (bpf_obj_name_cpy(ops->name, u_ops->name,
+                             sizeof(ops->name)) <= 0)
+            return -EINVAL;
+        return 1;  /* Handled */
+    }
+
+    /* Not handled, use default */
+    return 0;
+}
+
+static int fuse_bpf_reg(void *kdata, struct bpf_link *link)
+{
+       struct fuse_bpf_ops *ops = kdata;
+
+       active_bpf_ops = ops;
+
+       printk("fuse_bpf: registered ops '%s'\n", ops->name);
+
+       return 0;
+}
+
+static void fuse_bpf_unreg(void *kdata, struct bpf_link *link)
+{
+       struct fuse_bpf_ops *ops = kdata;
+
+       if (active_bpf_ops == ops)
+               active_bpf_ops = NULL;
+
+       printk("fuse_bpf: unregistered ops '%s'\n", ops->name);
+}
+
+static int __iomap_begin(struct inode *inode, loff_t pos,
+                        loff_t length, unsigned int flags,
+                        struct fuse_iomap_io *out_io)
+{
+       printk("stub __iomap_begin(). should never get called\n");
+       return 0;
+}
+
+static struct fuse_bpf_ops __fuse_bpf_ops = {
+       .iomap_begin = __iomap_begin,
+};
+
+static struct bpf_struct_ops fuse_bpf_struct_ops = {
+       .verifier_ops = &fuse_bpf_verifier_ops,
+       .init = fuse_bpf_ops_init,
+       .check_member = fuse_bpf_ops_check_member,
+       .init_member = fuse_bpf_ops_init_member,
+       .reg = fuse_bpf_reg,
+       .unreg = fuse_bpf_unreg,
+       .name = "fuse_bpf_ops",
+       .cfi_stubs = &__fuse_bpf_ops,
+       .owner = THIS_MODULE,
+};
+
+struct fuse_bpf_ops *fuse_get_bpf_ops(void)
+{
+       return active_bpf_ops;
+}
+
+int fuse_bpf_init(void)
+{
+       return register_bpf_struct_ops(&fuse_bpf_struct_ops, fuse_bpf_ops);
+}
+
+BTF_ID_LIST_GLOBAL_SINGLE(btf_fuse_bpf_ops_id, struct, fuse_bpf_ops)
+BTF_ID_LIST_GLOBAL_SINGLE(btf_fuse_iomap_io_id, struct, fuse_iomap_io)
diff --git a/fs/fuse/fuse_bpf.h b/fs/fuse/fuse_bpf.h
new file mode 100644
index 000000000000..d9482b64642b
--- /dev/null
+++ b/fs/fuse/fuse_bpf.h
@@ -0,0 +1,29 @@
+#ifndef _FS_FUSE_BPF_H
+#define _FS_FUSE_BPF_H
+
+#include "fuse_i.h"
+#include <linux/iomap.h>
+
+/* copied from darrick's iomap patchset */
+struct fuse_iomap_io {
+        uint64_t offset;        /* file offset of mapping, bytes */
+        uint64_t length;        /* length of mapping, bytes */
+        uint64_t addr;          /* disk offset of mapping, bytes */
+        uint16_t type;          /* FUSE_IOMAP_TYPE_* */
+        uint16_t flags;         /* FUSE_IOMAP_F_* */
+        uint32_t dev;           /* device cookie */
+};
+
+struct fuse_bpf_ops {
+    int (*iomap_begin)(struct inode *inode, loff_t pos,
+                      loff_t length, unsigned int flags,
+                      struct fuse_iomap_io *out_io__nullable);
+
+    /* Required for bpf struct_ops */
+    char name[16];
+};
+
+struct fuse_bpf_ops *fuse_get_bpf_ops(void);
+int fuse_bpf_init(void);
+
+#endif /* _FS_FUSE_BPF_H */
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 819e50d66622..78ae4425e863 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -7,6 +7,7 @@
 */

 #include "fuse_i.h"
+#include "fuse_bpf.h"
 #include "fuse_dev_i.h"
 #include "dev_uring_i.h"

@@ -662,6 +663,21 @@ static int fuse_statfs(struct dentry *dentry,
struct kstatfs *buf)
        struct fuse_statfs_out outarg;
        int err;

+       printk("fuse_statfs() called!\n");
+
+       struct fuse_bpf_ops *bpf_ops = fuse_get_bpf_ops();
+       struct fuse_iomap_io out_io = {};
+
+       /* call BPF prog if attached */
+       if (bpf_ops && bpf_ops->iomap_begin) {
+               err = bpf_ops->iomap_begin(d_inode(dentry), 111, 222,
+                                          333, &out_io);
+               printk("bpf prog returned: err=%d, out_io->offset=%llu\n",
+                      err, out_io.offset);
+       } else {
+               printk("did not find a bpf prog\n");
+       }
+
        if (!fuse_allow_current_process(fm->fc)) {
                buf->f_type = FUSE_SUPER_MAGIC;
                return 0;
@@ -2194,6 +2210,12 @@ static int __init fuse_fs_init(void)
        if (!fuse_inode_cachep)
                goto out;

+       err = fuse_bpf_init();
+       if (err) {
+               printk("fuse_bpf_init() failed %d\n", err);
+               goto out2;
+       }
+
        err = register_fuseblk();
        if (err)
                goto out2;


These are the changes for the libfuse side:
https://github.com/joannekoong/libfuse/commit/1a6198f17dd215c93fd82ec020641c079aae1241

To run it, run "make clean; make" from libfuse/example, and then sudo
ninja from libfuse/build, and then
 sudo ~/libfuse/build/example/passthrough_hp ~/liburing ~/mounts/tmp2
--nopassthrough --foreground

Then from ~/mounts/tmp2, run "stat -f filename" and that will show a few things:
on the kernel side it'll print a statement like "bpf prog returned:
err=0, out_io->offset=999" which shows that the prog can return back a
"struct fuse_iomap_io" with all the requisite metadata filled out
on the server-side, if you run " sudo cat
/sys/kernel/debug/tracing/trace_pipe", that should print out "
bpf_trace_printk: fuse iomap_begin: inode=ffff8a75cbe63800 pos=111
len=222 flags=333" which shows the prog can take in whatever
pos/len/flags values you pass it from the fuse kernel


>
> > and I was able to verify that iomap_begin() is able to return back
> > populated outarg fields from the bpf prog. If we were to actually
> > implement it i'm sure it'd be more complicated (eg we'd need to make
> > the fuse_bpf_ops registered per-connection, etc) but on the whole it
>
> What is a fuse_bpf_ops?  I'm assuming that's the attachment point for a
> bpf program that the fuse server would compile?  In which case, yes, I
> think that ought to be per-connection.
>
> So the bpf program can examine the struct inode, and the pos/len/opflags
> field; and from that information it has to write the appropriate fields
> in &outarg?  That's new, I didn't think bpf was allowed to write to
> kernel memory.  But it's been a few years since I last touched the bpf
> internals.

It's been a few years since I looked at bpf as well but yes
fuse_bpf_ops is basically the kernel-side struct_ops interface for
getting fuse to trigger the attached bpf program's callback
implementations. When the bpf program is loaded in, its callback
functions get swapped in and fuse_bpf_ops's function pointers now
point to the bpf's callback functions, so when you invoke
fuse_bpf_ops's callbacks, it calls into the bpf prog's callback.

>
> Some bpf programs might just know how to fill out outarg on their own
> (e.g. famfs memory interleaving) but other bpf programs might perform a
> range query on some imaginary bpf-interval-tree wherein you can do a
> fast lookup based on (inumber, pos, len)?
>
> I guess that's an interesting question -- would each fuse connection
> have one big bpf-interval-tree?  Or would you shard things by inode to
> reduce contention?  And if you sharded like that, then would you need a
> fuse_bpf_ops per inode?

Hmm the cache's tree is per-inode as I understand it, so probably each
inode would have its own tree / map?

>
> (I'm imagining that the fuse_bpf_ops might be where you'd stash the root
> of the bpf data structure, but I know nothing of bpf internals ;))
>
> Rolling on: how easy is it for a userspace program to compile and upload
> bpf programs into the kernel?  I've played around with bcc enough to
> write some fairly stupid latency tracing tools for XFS, but at the end
> of the day it still python scripts feeding a string full of maybe-C into
> whatever the BPF does under the hood.

I found it pretty easy with the libbpf library which will generate the
skeletons and provide the api helpers to load it in and other stuff
(the libfuse link I pasted above for the userspace side has the code
for compiling it and loading it).

>
> I /think/ it calls clang on the provided text, links that against the
> current kernel's header files, and pushes the compiled bpf binary into
> the kernel, right?  So fuse4fs would have to learn how to do that; and
> now fuse4fs has a runtime dependency on libllvm.

I think the libbpf library takes care of a lot of that for you. I
think fuse4fs would just need to do the same thing as in that libfuse
link above

>
> And while I'm on the topic of fuse-bpf uapi: It's ok for us to expose
> primitive-typed variables (pos/len/opflags) and existing fuse uapi
> directly to a bpf program, but I don't think we should expose struct
> inode/fuse_inode.  Maybe just fuse_inode::nodeid?  If we're careful not
> to allow #include'ing structured types in the fuse bpf code, then
> perhaps the bpf programs could be compiled at the same time as the fuse
> server.

I agree, I think if we do decide to go further with this approach,
we'll need to define exactly what the interfaces should be that would
be safe to expose.

>
> > seems doable. My worry is that if we land the iomap cache patchset now
> > then we can't remove it in the future without breaking backwards
> > compatibility for being a performance regression (though maybe we can
> > since the fuse-iomap stuff is experimental?), so imo it'd be great if
>
> I don't think it's a huge problem to remove functionality while the
> EXPERIMENTAL warnings are in place.  We'd forever lose the command codes
> for FUSE_NOTIFY_IOMAP_UPSERT and FUSE_NOTIFY_IOMAP_INVAL, but we've only
> used 12 out of INT_MAX so that's not likely to be a concern.
>
> > we figured out what direction we want to go before landing the cache
> > stuff. And I think we need to have this conversation too on the main
> > famfs patchset (eg whether it should go through your general iomap
> > plumbing with bpf helpers vs. being a separate implementation) since
> > once that lands, it'd be irrevocable.
>
> I've of two minds on that -- John got here first, so I don't want to
> delay his patchset whilst I slowly work on this thing.  OTOH from an
> architecture standpoint we probably ought to push for three ways for a
> fuse server to upload mappings:

I think if John/Miklos wanted to go in this direction, all that would
be needed from your series is the first one or two patches that define
the basic fuse_iomap_io / fuse_iomap_begin / fuse_iomap_end structs
and init config plumbing.

Thanks,
Joanne

>
> 1. Upserting mappings with arbitrary offset and size into a cache
> 2. Self contained bpf program that can generate any mapping
> 3. Sprawling bpf program that can read any other artifacts that another
>    bpf program might have set up for it
>
> But yeah, let's involve John.
>
> --D
>
> >
> > Thanks,
> > Joanne
> > >
> > > --D
> > >
> > > > Thanks,
> > > > Joanne
> > > >
> > > > >
> > > > > <nod>
> > > > >
> > > > > --D
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Joanne
> > > > > >
> > > > > > >
> > > > > > > OTOH it would be enormously hilarious to me if one could load a file
> > > > > > > mapping predictive model into the kernel as a bpf program and use that
> > > > > > > as a first tier before checking the in-memory btree mapping cache from
> > > > > > > patchset 7.  Quite a few years ago now there was a FAST paper
> > > > > > > establishing that even a stupid linear regression model could in theory
> > > > > > > beat a disk btree lookup.
> > > > > > >
> > > > > > > --D
> > > > > > >
> > > > > > > > Thanks,
> > > > > > > > Joanne
> > > > > > > >
> > > > > > > > >
> > > > > > > > > If you're going to start using this code, I strongly recommend pulling
> > > > > > > > > from my git trees, which are linked below.
> > > > > > > > >
> > > > > > > > > This has been running on the djcloud for months with no problems.  Enjoy!
> > > > > > > > > Comments and questions are, as always, welcome.
> > > > > > > > >
> > > > > > > > > --D
> > > > > > > > >
> > > > > > > > > kernel git tree:
> > > > > > > > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > > > > > > > ---
> > > > > > > > > Commits in this patchset:
> > > > > > > > >  * fuse: implement the basic iomap mechanisms
> > > > > > > > >  * fuse_trace: implement the basic iomap mechanisms
> > > > > > > > >  * fuse: make debugging configurable at runtime
> > > > > > > > >  * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > > >  * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > > > > > > >  * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > > > > > > >  * fuse: create a per-inode flag for toggling iomap
> > > > > > > > >  * fuse_trace: create a per-inode flag for toggling iomap
> > > > > > > > >  * fuse: isolate the other regular file IO paths from iomap
> > > > > > > > >  * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > > >  * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > > > > > > >  * fuse: implement direct IO with iomap
> > > > > > > > >  * fuse_trace: implement direct IO with iomap
> > > > > > > > >  * fuse: implement buffered IO with iomap
> > > > > > > > >  * fuse_trace: implement buffered IO with iomap
> > > > > > > > >  * fuse: implement large folios for iomap pagecache files
> > > > > > > > >  * fuse: use an unrestricted backing device with iomap pagecache io
> > > > > > > > >  * fuse: advertise support for iomap
> > > > > > > > >  * fuse: query filesystem geometry when using iomap
> > > > > > > > >  * fuse_trace: query filesystem geometry when using iomap
> > > > > > > > >  * fuse: implement fadvise for iomap files
> > > > > > > > >  * fuse: invalidate ranges of block devices being used for iomap
> > > > > > > > >  * fuse_trace: invalidate ranges of block devices being used for iomap
> > > > > > > > >  * fuse: implement inline data file IO via iomap
> > > > > > > > >  * fuse_trace: implement inline data file IO via iomap
> > > > > > > > >  * fuse: allow more statx fields
> > > > > > > > >  * fuse: support atomic writes with iomap
> > > > > > > > >  * fuse_trace: support atomic writes with iomap
> > > > > > > > >  * fuse: disable direct reclaim for any fuse server that uses iomap
> > > > > > > > >  * fuse: enable swapfile activation on iomap
> > > > > > > > >  * fuse: implement freeze and shutdowns for iomap filesystems
> > > > > > > > > ---
> > > > > > > > >  fs/fuse/fuse_i.h          |  161 +++
> > > > > > > > >  fs/fuse/fuse_trace.h      |  939 +++++++++++++++++++
> > > > > > > > >  fs/fuse/iomap_i.h         |   52 +
> > > > > > > > >  include/uapi/linux/fuse.h |  219 ++++
> > > > > > > > >  fs/fuse/Kconfig           |   48 +
> > > > > > > > >  fs/fuse/Makefile          |    1
> > > > > > > > >  fs/fuse/backing.c         |   12
> > > > > > > > >  fs/fuse/dev.c             |   30 +
> > > > > > > > >  fs/fuse/dir.c             |  120 ++
> > > > > > > > >  fs/fuse/file.c            |  133 ++-
> > > > > > > > >  fs/fuse/file_iomap.c      | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > > >  fs/fuse/inode.c           |  162 +++
> > > > > > > > >  fs/fuse/iomode.c          |    2
> > > > > > > > >  fs/fuse/trace.c           |    2
> > > > > > > > >  14 files changed, 4056 insertions(+), 55 deletions(-)
> > > > > > > > >  create mode 100644 fs/fuse/iomap_i.h
> > > > > > > > >  create mode 100644 fs/fuse/file_iomap.c
> > > > > > > > >
> > > > > > > >
> > > >
> >