[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250916000759.GA8080@frogsfrogsfrogs>
Date: Mon, 15 Sep 2025 17:07:59 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: linux-fsdevel <linux-fsdevel@...r.kernel.org>
Cc: Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bernd@...ernd.com>,
Joanne Koong <joannelkoong@...il.com>,
John Groves <John@...ves.net>, Josef Bacik <josef@...icpanda.com>,
linux-ext4 <linux-ext4@...r.kernel.org>,
Theodore Ts'o <tytso@....edu>, Neal Gompa <neal@...pa.dev>,
Amir Goldstein <amir73il@...il.com>,
Christian Brauner <brauner@...nel.org>,
Jeff Layton <jlayton@...nel.org>
Subject: [RFC v5] fuse: containerize ext4 for safer operation
Hi everyone,
[Ok maybe it's time to merge some of this stuff. I'm removing the RFC
tag, but most likely the only patches that should get merged at this
point are the bugfixes at the start. Don't merge the rest until after
the 2025 LTS kernel merge window closes, please.]
This is the fifth public draft of a prototype to connect the Linux fuse
driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices. With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance. Furthermore, I also show that the userspace program
runs with minimal privilege, which means that we no longer need to have
filesystem metadata parsing be a privileged (== risky) operation.
Why would you want to do that? Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence. Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.
willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code. Eeeugh.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.
At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance. Random buffered IO is about 85% as
fast as the kernel. Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details. Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
These items have been addressed since the fourth RFC:
1. After six months, I have achieved my primary goal: a containerized
filesystem server! We can now run fuse4fs as a completely
unprivileged and namespace-restricted systemd service on behalf of
anyone who can open a file and mount it. Many thanks again to
Christian (and Miklos and Bernd and Amir) for their help!
Someone who knows how to design socket-based protocols ought to have
a look at the libfuse changes. The mount helper and the fuse server
communicate via a AF_UNIX socket, which enables the mount helper to
pass resources into the service container.
2. I took a stab at implementing fsdax. I then encountered the horror
that is dax_writeback_mapping_range and abandoned that work.
Writeback needs to iterate the file mappings and not make assumptions
about the backing device ... but that's not a problem that anyone
here needs to solve.
3. struct fuse_inode shrank after I verified that the iomap fileio paths
never have to venture into the regular or wb cache paths.
4. fstests passes 99% of the tests that run, when iomap is enabled!
96% pass when iomap is disabled, and I think that's due to some
bugs in fstests.
5. Some VFS iflags (sync/immutable/append) now work.
6. iomap and passthrough share the backing file management code. They
are not expected to share backing files.
There are some major warts remaining:
a. I would like to start a discussion about how the design review of
this code should be structured, and how might I go about creating new
userspace filesystem servers -- lightweight new ones based off the
existing userspace tools? Or by merging lklfuse?
b. No design review document yet.
c. Why aren't we at 100% fstests passing? Even with the kernel ext4?
d. I'm not 100% certain that the code that handles EOF zeroing actually
works correctly. Does fuse+iomap need to track both the server's
and the VFS' notion of EOF the same way that XFS does?
e. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
f. fuse2fs doesn't support the ext4 journal. Urk.
g. There's a VERY large quantity of fuse2fs improvements that need to be
applied before we get to the fuse-iomap parts. I'm not sending these
(or the fstests changes) to keep the size of the patchbomb at
"unreasonably large". :P
I'll work on these in October, but now you all have an alpha-complete
demonstration to take a look at.
--Darrick
Powered by blists - more mailing lists