linux-ext4 - Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxgUVOLs070MyBpfodt12E0zjUn_SvyaCSJcm_M3SW36Ug@mail.gmail.com>
Date: Tue, 10 Jun 2025 12:59:36 +0200
From: Amir Goldstein <amir73il@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: linux-fsdevel <linux-fsdevel@...r.kernel.org>, John@...ves.net, bernd@...ernd.com, 
	miklos@...redi.hu, joannelkoong@...il.com, Josef Bacik <josef@...icpanda.com>, 
	linux-ext4 <linux-ext4@...r.kernel.org>, "Theodore Ts'o" <tytso@....edu>
Subject: Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can
 containerize ext4

On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@...nel.org> wrote:
>
> On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> >  or
> >
> > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > >
> > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > >
> > > > > Hi everyone,
> > > > >
> > > > > DO NOT MERGE THIS.
> > > > >
> > > > > This is the very first request for comments of a prototype to connect
> > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > from files whose contents persist to locally attached storage devices.
> > > > >
> > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > server process.
> > > > >
> > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > because you have to understand every filesystem's bespoke use of that
> > > > > core code.  Eeeugh.
> > > > >
> > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > but solving that is for the next sprint.
> > > > >
> > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > >
> > > >
> > > > Very cool!
> > > >
> > > > > There are some major warts remaining:
> > > > >
> > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > support unwritten and delalloc mappings.
> > > > >
> > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > >
> > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > yet figured out how inline data is supposed to work.
> > > > >
> > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > inode it just read.
> > > >
> > > > Can you make the decision about enabling iomap on lookup?
> > > > The plan for passthrough for inode operations was to allow
> > > > setting up passthough config of inode on lookup.
> > >
> > > The main requirement (especially for buffered IO) is that we've set the
> > > address space operations structure either to the regular fuse one or to
> > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > code assumes that cannot change on a live inode.
> > >
> > > So I /think/ we could ask the fuse server at inode instantiation time
> > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > to userspace at that time.  Alternately I guess we could extend struct
> > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > >
> >
> > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > which is in the responses of FUSE_LOOKUP,
> > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > which instantiate fuse inodes.
> >
> > There is a very hand wavy discussion about this at:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> >
> > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > command that uses the variable length file handle instead of nodeid
> > as a key for the inode.
> >
> > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > look at the gritty details of how best to extend all the relevant commands,
> > so I hope I am not sending you down the wrong path.
>
> I found another twist to this story: the upper level libfuse3 library
> assigns distinct nodeids for each directory entry.  These nodeids are
> passed into the kernel and appear to the basis for an iget5_locked call.
> IOWs, each nodeid causes a struct fuse_inode to be created in the
> kernel.
>
> For a single-linked file this is no big deal, but for a hardlink this
> makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> map to multiple kernel fuse_inode objects.  This /really/ breaks the
> locking model of iomap, which assumes that there's one in-kernel inode
> and that it can use i_rwsem to synchronize updates.
>
> So I'm going to have to find a way to deal with this.  I tried trivially
> messing with libfuse nodeid assigment but that blew some assertion.
> Maybe your LOOKUP_HANDLE thing would work.
>

Pull the emergency break!

In an amature move, I did not look at fuse2fs.c before commenting on your
work.

High level fuse interface is not the right tool for the job.
It's not even the easiest way to have written fuse2fs in the first place.

High-level fuse API addresses file system objects with full paths.
This is good for writing simple virtual filesystems, but it is not the
correct nor is the easiest choice to write a userspace driver for ext4.

Low-level fuse interface addresses filesystem objects by nodeid
and requires the server to implement lookup(parent_nodeid, name)
where the server gets to choose the nodeid (not libfuse).

current fuse2fs code needs to go to an effort to convert from full path
to inode + name using ext2fs_namei().

With the low-level fuse op_lookup() might have used the native ext2_lookup()
which would have been much more natural.

You can find the most featureful low-level fuse example at:
https://github.com/libfuse/libfuse/blob/master/example/passthrough_hp.cc

Among other things, the server has an inode cache, where an inode
has in its state 'nopen' (was this inode opened for io) and 'backing_id'
(was this inode mapped for kernel passthrough).

Currently this backing_id mapping is only made on first open of inode,
but the plan is to do that also at lookup time, for example, if the
iomap mode for the inode can be determined at lookup time.


> > > > > 5. ext4 doesn't support out of place writes so I don't know if that
> > > > > actually works correctly.
> > > > >
> > > > > 6. iomap is an inode-based service, not a file-based service.  This
> > > > > means that we /must/ push ext2's inode numbers into the kernel via
> > > > > FUSE_GETATTR so that it can report those same numbers back out through
> > > > > the FUSE_IOMAP_* calls.  However, the fuse kernel uses a separate nodeid
> > > > > to index its incore inode, so we have to pass those too so that
> > > > > notifications work properly.
> > > > >
> > > >
> > > > Again, I might be missing something, but as long as the fuse filesystem
> > > > is exposing a single backing filesystem, it should be possible to make
> > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs
> > > > inode number.
> > > > See sketch in this WIP branch:
> > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575
> > >
> > > I think this would work in many places, except for filesystems with
> > > 64-bit inumbers on 32-bit machines.  That might be a good argument for
> > > continuing to pass along the nodeid and fuse_inode::orig_ino like it
> > > does now.  Plus there are some filesystems that synthesize inode numbers
> > > so tying the two together might not be feasible/desirable anyway.
> > >
> > > Though one nice feature of letting fuse have its own nodeids might be
> > > that if the in-memory index switches to a tree structure, then it could
> > > be more compact if the filesystem's inumbers are fairly sparse like xfs.
> > > OTOH the current inode hashtable has been around for a very long time so
> > > that might not be a big concern.  For fuse2fs it doesn't matter since
> > > ext4 inumbers are u32.
> > >
> >
> > I wanted to see if declaring one-to-one 64bit ino can simplify things
> > for the first version of inode ops passthrough.
> > If this is not the case, or if this is too much of a limitation for
> > your use case
> > then nevermind.
> > But if it is a good enough shortcut for the demo and can be extended later,
> > then why not.
>
> It's very tempting, because it's very confusing to have nodeids and
> stat st_ino not be the same thing.
>

Now that I have explained that fuse2fs should be low-level, it should be
trivial to claim that it should have no problem to declare via
FUSE_PASSTHROUGH_INO flag to the kernel that nodeid == st_ino,
because I see no reason to implement fuse2fs with non one-to-one
mapping of ino <==> nodeid.

Thanks,
Amir.