lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250611060040.GC6138@frogsfrogsfrogs>
Date: Tue, 10 Jun 2025 23:00:40 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Amir Goldstein <amir73il@...il.com>
Cc: linux-fsdevel <linux-fsdevel@...r.kernel.org>, John@...ves.net,
	bernd@...ernd.com, miklos@...redi.hu, joannelkoong@...il.com,
	Josef Bacik <josef@...icpanda.com>,
	linux-ext4 <linux-ext4@...r.kernel.org>,
	Theodore Ts'o <tytso@....edu>
Subject: Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can
 containerize ext4

On Tue, Jun 10, 2025 at 09:51:55PM +0200, Amir Goldstein wrote:
> On Tue, Jun 10, 2025 at 9:00 PM Darrick J. Wong <djwong@...nel.org> wrote:
> >
> > On Tue, Jun 10, 2025 at 12:59:36PM +0200, Amir Goldstein wrote:
> > > On Tue, Jun 10, 2025 at 12:32 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > > >
> > > > On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote:
> > > > >  or
> > > > >
> > > > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > >
> > > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote:
> > > > > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@...nel.org> wrote:
> > > > > > > >
> > > > > > > > Hi everyone,
> > > > > > > >
> > > > > > > > DO NOT MERGE THIS.
> > > > > > > >
> > > > > > > > This is the very first request for comments of a prototype to connect
> > > > > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and
> > > > > > > > from files whose contents persist to locally attached storage devices.
> > > > > > > >
> > > > > > > > Why would you want to do that?  Most filesystem drivers are seriously
> > > > > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
> > > > > > > > over almost a decade of its existence.  Faulty code can lead to total
> > > > > > > > kernel compromise, and I think there's a very strong incentive to move
> > > > > > > > all that parsing out to userspace where we can containerize the fuse
> > > > > > > > server process.
> > > > > > > >
> > > > > > > > willy's folios conversion project (and to a certain degree RH's new
> > > > > > > > mount API) have also demonstrated that treewide changes to the core
> > > > > > > > mm/pagecache/fs code are very very difficult to pull off and take years
> > > > > > > > because you have to understand every filesystem's bespoke use of that
> > > > > > > > core code.  Eeeugh.
> > > > > > > >
> > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin,
> > > > > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls
> > > > > > > > to the fuse server via a trio of new fuse commands.  This is suitable
> > > > > > > > for very simple filesystems that don't do tricky things with mappings
> > > > > > > > (e.g. FAT/HFS) during writeback.  This isn't quite adequate for ext4,
> > > > > > > > but solving that is for the next sprint.
> > > > > > > >
> > > > > > > > With this overly simplistic RFC, I am to show that it's possible to
> > > > > > > > build a fuse server for a real filesystem (ext4) that runs entirely in
> > > > > > > > userspace yet maintains most of its performance.  At this early stage I
> > > > > > > > get about 95% of the kernel ext4 driver's streaming directio performance
> > > > > > > > on streaming IO, and 110% of its streaming buffered IO performance.
> > > > > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent
> > > > > > > > conversions.  Random direct IO is about 60% as fast as the kernel; see
> > > > > > > > the cover letter for the fuse2fs iomap changes for more details.
> > > > > > > >
> > > > > > >
> > > > > > > Very cool!
> > > > > > >
> > > > > > > > There are some major warts remaining:
> > > > > > > >
> > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle
> > > > > > > > races between pagecache zeroing and writeback on filesystems that
> > > > > > > > support unwritten and delalloc mappings.
> > > > > > > >
> > > > > > > > 2. Mappings ought to be cached in the kernel for more speed.
> > > > > > > >
> > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't
> > > > > > > > yet figured out how inline data is supposed to work.
> > > > > > > >
> > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis,
> > > > > > > > which currently isn't possible because the kernel fuse driver will iget
> > > > > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the
> > > > > > > > inode it just read.
> > > > > > >
> > > > > > > Can you make the decision about enabling iomap on lookup?
> > > > > > > The plan for passthrough for inode operations was to allow
> > > > > > > setting up passthough config of inode on lookup.
> > > > > >
> > > > > > The main requirement (especially for buffered IO) is that we've set the
> > > > > > address space operations structure either to the regular fuse one or to
> > > > > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c
> > > > > > code assumes that cannot change on a live inode.
> > > > > >
> > > > > > So I /think/ we could ask the fuse server at inode instantiation time
> > > > > > (which, if I'm reading the code correctly, is when iget5_locked gives
> > > > > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall
> > > > > > to userspace at that time.  Alternately I guess we could extend struct
> > > > > > fuse_attr with another FUSE_ATTR_ flag, I think?
> > > > > >
> > > > >
> > > > > The latter. Either extend fuse_attr or struct fuse_entry_out,
> > > > > which is in the responses of FUSE_LOOKUP,
> > > > > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE.
> > > > > which instantiate fuse inodes.
> > > > >
> > > > > There is a very hand wavy discussion about this at:
> > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/
> > > > >
> > > > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE
> > > > > command that uses the variable length file handle instead of nodeid
> > > > > as a key for the inode.
> > > > >
> > > > > So we will have to extend fuse_entry_out anyway, but TBH I never got to
> > > > > look at the gritty details of how best to extend all the relevant commands,
> > > > > so I hope I am not sending you down the wrong path.
> > > >
> > > > I found another twist to this story: the upper level libfuse3 library
> > > > assigns distinct nodeids for each directory entry.  These nodeids are
> > > > passed into the kernel and appear to the basis for an iget5_locked call.
> > > > IOWs, each nodeid causes a struct fuse_inode to be created in the
> > > > kernel.
> > > >
> > > > For a single-linked file this is no big deal, but for a hardlink this
> > > > makes iomap a mess because this means that in fuse2fs, an ext2 inode can
> > > > map to multiple kernel fuse_inode objects.  This /really/ breaks the
> > > > locking model of iomap, which assumes that there's one in-kernel inode
> > > > and that it can use i_rwsem to synchronize updates.
> > > >
> > > > So I'm going to have to find a way to deal with this.  I tried trivially
> > > > messing with libfuse nodeid assigment but that blew some assertion.
> > > > Maybe your LOOKUP_HANDLE thing would work.
> > > >
> > >
> > > Pull the emergency break!
> > >
> > > In an amature move, I did not look at fuse2fs.c before commenting on your
> > > work.
> > >
> > > High level fuse interface is not the right tool for the job.
> > > It's not even the easiest way to have written fuse2fs in the first place.
> >
> > At the time I thought it would minimize friction across multiple
> > operating systems' fuse implementations.
> >
> > > High-level fuse API addresses file system objects with full paths.
> > > This is good for writing simple virtual filesystems, but it is not the
> > > correct nor is the easiest choice to write a userspace driver for ext4.
> >
> > Agreed, it's a *terrible* way to implement ext4.
> >
> > I think, however, that Ted would like to maintain compatibility with
> > macfuse and freebsd(?) so he's been resistant to rewriting the entire
> > program to work with the lowlevel library.
> >
> > That said, I decided just now to do some spelunking into those two fuse
> > ports and have discovered that freebsd[1] packages the same upstream
> > libfuse as linux, and macfuse[2] seems to vendor both libfuse 2 and 3.
> >
> > [1] https://wiki.freebsd.org/FUSEFS
> > [2] https://github.com/macfuse/macfuse
> >
> > Seeing as Debian 13 has killed off libfuse2 entirely, maybe I should
> > think about rewriting all of fuse2fs against the lowlevel library?  It's
> > really annoying to deal with all the problems of the current codebase.
> > I think I'll try to stabilize the current fuse+iomap code and then look
> > into a fuse2fs port.  What would we call it, fuse4fs? :D
> >
> > > Low-level fuse interface addresses filesystem objects by nodeid
> > > and requires the server to implement lookup(parent_nodeid, name)
> > > where the server gets to choose the nodeid (not libfuse).
> >
> > Does the nodeid for the root directory have to be FUSE_ROOT_ID?
> 
> Yeh, I think that's the case, otherwise FUSE_INIT would need to
> tell the kernel the root nodeid, because there is no lookup to
> return the root nodeid.
> 
> > I guess
> > for ext4 that's not a big deal since ext2 inode #1 is the badblocks file
> > which cannot be accessed from userspace anyway.
> >
> 
> As long as inode #1 is reserved it should be fine.
> just need to refine the rules of the one-to-one mapping with
> this exception.

Or just make it so that passthrough_ino filesystems can specify the
rootdir inumber?

--D

> Thanks,
> Amir.
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ