lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20260127232125.GA5966@frogsfrogsfrogs>
Date: Tue, 27 Jan 2026 15:21:25 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Joanne Koong <joannelkoong@...il.com>
Cc: miklos@...redi.hu, bernd@...ernd.com, neal@...pa.dev,
	linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCHSET v6 4/8] fuse: allow servers to use iomap for better
 file IO performance

On Tue, Jan 27, 2026 at 11:47:31AM -0800, Joanne Koong wrote:
> On Mon, Jan 26, 2026 at 6:22 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > On Mon, Jan 26, 2026 at 04:59:16PM -0800, Joanne Koong wrote:
> > > On Tue, Oct 28, 2025 at 5:38 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > >
> > > > Hi all,
> > > >
> > > > This series connects fuse (the userspace filesystem layer) to fs-iomap
> > > > to get fuse servers out of the business of handling file I/O themselves.
> > > > By keeping the IO path mostly within the kernel, we can dramatically
> > > > improve the speed of disk-based filesystems.  This enables us to move
> > > > all the filesystem metadata parsing code out of the kernel and into
> > > > userspace, which means that we can containerize them for security
> > > > without losing a lot of performance.
> > >
> > > I haven't looked through how the fuse2fs or fuse4fs servers are
> > > implemented yet (also, could you explain the difference between the
> > > two? Which one should we look at to see how it all ties together?),
> >
> > fuse4fs is a lowlevel fuse server; fuse2fs is a high(?) level fuse
> > server.  fuse4fs is the successor to fuse2fs, at least on Linux and BSD.
> 
> Ah I see, thanks for the explanation. In that case, I'll just look at
> fuse4fs then.
> 
> >
> > > but I wonder if having bpf infrastructure hooked up to fuse would be
> > > especially helpful for what you're doing here with fuse iomap. afaict,
> > > every read/write whether it's buffered or direct will incur at least 1
> > > call to ->iomap_begin() to get the mapping metadata, which will be 2
> > > context-switches (and if the server has ->iomap_end() implemented,
> > > then 2 more context-switches).
> >
> > Yes, I agree that's a lot of context switching for file IO...
> >
> > > But it seems like the logic for retrieving mapping
> > > offsets/lengths/metadata should be pretty straightforward?
> >
> > ...but it gets very cheap if the fuse server can cache mappings in the
> > kernel to avoid all that.  That is, incidentally, what patchset #7
> > implements.
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-cache_2026-01-22
> >
> > > If the extent lookups are table lookups or tree
> > > traversals without complex side effects, then having
> > > ->iomap_begin()/->iomap_end() be executed as a bpf program would avoid
> > > the context switches and allow all the caching logic to be moved from
> > > the kernel to the server-side (eg using bpf maps).
> >
> > Hrmm.  Now that /is/ an interesting proposal.  Does BPF have a data
> > structure that supports interval mappings?  I think the existing bpf map
> 
> Not yet but I don't see why a b+ tree like data strucutre couldn't be added.
> Maybe one workaround in the meantime that could work is using a sorted
> array map and doing binary search on that, until interval mappings can
> be natively supported?

I guess, though I already had a C structure to borrow from xfs ;)

> > only does key -> value.  Also, is there an upper limit on the size of a
> > map?  You could have hundreds of millions of maps for a very fragmented
> > regular file.
> 
> If I'm remembering correctly, there's an upper limit on the number of
> map entries, which is bounded by u32

That's problematic, since files can have 64-bit logical block numbers.

> > At one point I suggested to the famfs maintainer that it might be
> > easier/better to implement the interleaved mapping lookups as bpf
> > programs instead of being stuck with a fixed format in the fuse
> > userspace abi, but I don't know if he ever implemented that.
> 
> This seems like a good use case for it too
> >
> > > Is this your
> > > assessment of it as well or do you think the server-side logic for
> > > iomap_begin()/iomap_end() is too complicated to make this realistic?
> > > Asking because I'm curious whether this direction makes sense, not
> > > because I think it would be a blocker for your series.
> >
> > For disk-based filesystems I think it would be difficult to model a bpf
> > program to do mappings, since they can basically point anywhere and be
> > of any size.
> 
> Hmm I'm not familiar enough with disk-based filesystems to know what
> the "point anywhere and be of any size" means. For the mapping stuff,
> doesn't it just point to a block number? Or are you saying the problem
> would be there's too many mappings since a mapping could be any size?

The second -- mappings can be any size, and unprivileged userspace can
control the mappings.

> I was thinking the issue would be more that there might be other logic
> inside ->iomap_begin()/->iomap_end() besides the mapping stuff that
> would need to be done that would be too out-of-scope for bpf. But I
> think I need to read through the fuse4fs stuff to understand more what
> it's doing in those functions.

<nod>

--D

> 
> Thanks,
> Joanne
> 
> >
> > OTOH it would be enormously hilarious to me if one could load a file
> > mapping predictive model into the kernel as a bpf program and use that
> > as a first tier before checking the in-memory btree mapping cache from
> > patchset 7.  Quite a few years ago now there was a FAST paper
> > establishing that even a stupid linear regression model could in theory
> > beat a disk btree lookup.
> >
> > --D
> >
> > > Thanks,
> > > Joanne
> > >
> > > >
> > > > If you're going to start using this code, I strongly recommend pulling
> > > > from my git trees, which are linked below.
> > > >
> > > > This has been running on the djcloud for months with no problems.  Enjoy!
> > > > Comments and questions are, as always, welcome.
> > > >
> > > > --D
> > > >
> > > > kernel git tree:
> > > > https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=fuse-iomap-fileio
> > > > ---
> > > > Commits in this patchset:
> > > >  * fuse: implement the basic iomap mechanisms
> > > >  * fuse_trace: implement the basic iomap mechanisms
> > > >  * fuse: make debugging configurable at runtime
> > > >  * fuse: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > >  * fuse_trace: adapt FUSE_DEV_IOC_BACKING_{OPEN,CLOSE} to add new iomap devices
> > > >  * fuse: flush events and send FUSE_SYNCFS and FUSE_DESTROY on unmount
> > > >  * fuse: create a per-inode flag for toggling iomap
> > > >  * fuse_trace: create a per-inode flag for toggling iomap
> > > >  * fuse: isolate the other regular file IO paths from iomap
> > > >  * fuse: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > >  * fuse_trace: implement basic iomap reporting such as FIEMAP and SEEK_{DATA,HOLE}
> > > >  * fuse: implement direct IO with iomap
> > > >  * fuse_trace: implement direct IO with iomap
> > > >  * fuse: implement buffered IO with iomap
> > > >  * fuse_trace: implement buffered IO with iomap
> > > >  * fuse: implement large folios for iomap pagecache files
> > > >  * fuse: use an unrestricted backing device with iomap pagecache io
> > > >  * fuse: advertise support for iomap
> > > >  * fuse: query filesystem geometry when using iomap
> > > >  * fuse_trace: query filesystem geometry when using iomap
> > > >  * fuse: implement fadvise for iomap files
> > > >  * fuse: invalidate ranges of block devices being used for iomap
> > > >  * fuse_trace: invalidate ranges of block devices being used for iomap
> > > >  * fuse: implement inline data file IO via iomap
> > > >  * fuse_trace: implement inline data file IO via iomap
> > > >  * fuse: allow more statx fields
> > > >  * fuse: support atomic writes with iomap
> > > >  * fuse_trace: support atomic writes with iomap
> > > >  * fuse: disable direct reclaim for any fuse server that uses iomap
> > > >  * fuse: enable swapfile activation on iomap
> > > >  * fuse: implement freeze and shutdowns for iomap filesystems
> > > > ---
> > > >  fs/fuse/fuse_i.h          |  161 +++
> > > >  fs/fuse/fuse_trace.h      |  939 +++++++++++++++++++
> > > >  fs/fuse/iomap_i.h         |   52 +
> > > >  include/uapi/linux/fuse.h |  219 ++++
> > > >  fs/fuse/Kconfig           |   48 +
> > > >  fs/fuse/Makefile          |    1
> > > >  fs/fuse/backing.c         |   12
> > > >  fs/fuse/dev.c             |   30 +
> > > >  fs/fuse/dir.c             |  120 ++
> > > >  fs/fuse/file.c            |  133 ++-
> > > >  fs/fuse/file_iomap.c      | 2230 +++++++++++++++++++++++++++++++++++++++++++++
> > > >  fs/fuse/inode.c           |  162 +++
> > > >  fs/fuse/iomode.c          |    2
> > > >  fs/fuse/trace.c           |    2
> > > >  14 files changed, 4056 insertions(+), 55 deletions(-)
> > > >  create mode 100644 fs/fuse/iomap_i.h
> > > >  create mode 100644 fs/fuse/file_iomap.c
> > > >
> > >

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ