lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250916000759.GA8080@frogsfrogsfrogs>
Date: Mon, 15 Sep 2025 17:07:59 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: linux-fsdevel <linux-fsdevel@...r.kernel.org>
Cc: Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bernd@...ernd.com>,
	Joanne Koong <joannelkoong@...il.com>,
	John Groves <John@...ves.net>, Josef Bacik <josef@...icpanda.com>,
	linux-ext4 <linux-ext4@...r.kernel.org>,
	Theodore Ts'o <tytso@....edu>, Neal Gompa <neal@...pa.dev>,
	Amir Goldstein <amir73il@...il.com>,
	Christian Brauner <brauner@...nel.org>,
	Jeff Layton <jlayton@...nel.org>
Subject: [RFC v5] fuse: containerize ext4 for safer operation

Hi everyone,

[Ok maybe it's time to merge some of this stuff.  I'm removing the RFC
tag, but most likely the only patches that should get merged at this
point are the bugfixes at the start.  Don't merge the rest until after
the 2025 LTS kernel merge window closes, please.]

This is the fifth public draft of a prototype to connect the Linux fuse
driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices.  With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance.  Furthermore, I also show that the userspace program
runs with minimal privilege, which means that we no longer need to have
filesystem metadata parsing be a privileged (== risky) operation.

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.

willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code.  Eeeugh.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands.  Pagecache
writeback is now a directio write.  The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.

At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance.  Random buffered IO is about 85% as
fast as the kernel.  Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details.  Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead.  And that's with (now dynamic) debugging turned on!

These items have been addressed since the fourth RFC:

1. After six months, I have achieved my primary goal: a containerized
   filesystem server!  We can now run fuse4fs as a completely
   unprivileged and namespace-restricted systemd service on behalf of
   anyone who can open a file and mount it.  Many thanks again to
   Christian (and Miklos and Bernd and Amir) for their help!

   Someone who knows how to design socket-based protocols ought to have
   a look at the libfuse changes.  The mount helper and the fuse server
   communicate via a AF_UNIX socket, which enables the mount helper to
   pass resources into the service container.

2. I took a stab at implementing fsdax.  I then encountered the horror
   that is dax_writeback_mapping_range and abandoned that work.
   Writeback needs to iterate the file mappings and not make assumptions
   about the backing device ... but that's not a problem that anyone
   here needs to solve.

3. struct fuse_inode shrank after I verified that the iomap fileio paths
   never have to venture into the regular or wb cache paths.

4. fstests passes 99% of the tests that run, when iomap is enabled!
   96% pass when iomap is disabled, and I think that's due to some
   bugs in fstests.

5. Some VFS iflags (sync/immutable/append) now work.

6. iomap and passthrough share the backing file management code.  They
   are not expected to share backing files.

There are some major warts remaining:

a. I would like to start a discussion about how the design review of
   this code should be structured, and how might I go about creating new
   userspace filesystem servers -- lightweight new ones based off the
   existing userspace tools?  Or by merging lklfuse?

b. No design review document yet.

c. Why aren't we at 100% fstests passing?  Even with the kernel ext4?

d. I'm not 100% certain that the code that handles EOF zeroing actually
   works correctly.  Does fuse+iomap need to track both the server's
   and the VFS' notion of EOF the same way that XFS does?

e. ext4 doesn't support out of place writes so I don't know if that
   actually works correctly.

f. fuse2fs doesn't support the ext4 journal.  Urk.

g. There's a VERY large quantity of fuse2fs improvements that need to be
   applied before we get to the fuse-iomap parts.  I'm not sending these
   (or the fstests changes) to keep the size of the patchbomb at
   "unreasonably large". :P

I'll work on these in October, but now you all have an alpha-complete
demonstration to take a look at.

--Darrick




Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ