[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250821003720.GA4194186@frogsfrogsfrogs>
Date: Wed, 20 Aug 2025 17:37:20 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: linux-fsdevel <linux-fsdevel@...r.kernel.org>
Cc: Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bernd@...ernd.com>,
Joanne Koong <joannelkoong@...il.com>,
John Groves <John@...ves.net>, Josef Bacik <josef@...icpanda.com>,
linux-ext4 <linux-ext4@...r.kernel.org>,
Theodore Ts'o <tytso@....edu>, Neal Gompa <neal@...pa.dev>,
Amir Goldstein <amir73il@...il.com>,
Christian Brauner <brauner@...nel.org>,
Jeff Layton <jlayton@...nel.org>
Subject: [RFC v4] fuse: use fs-iomap for better performance so we can
containerize ext4
Hi everyone,
Do not merge this, still!!
This is the fourth request for comments of a prototype to connect the
Linux fuse driver to fs-iomap for regular file IO operations to and from
files whose contents persist to locally attached storage devices.
Why would you want to do that? Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence. Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.
willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code. Eeeugh.
The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands. Pagecache
writeback is now a directio write. The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.
With this RFC, I am able to show that it's possible to build a fuse
server for a real filesystem (ext4) that runs entirely in userspace yet
maintains most of its performance. At this stage I still get about 95%
of the kernel ext4 driver's streaming directio performance on streaming
IO, and 110% of its streaming buffered IO performance. Random buffered
IO is about 85% as fast as the kernel. Random direct IO is about 80% as
fast as the kernel; see the cover letter for the fuse2fs iomap changes
for more details. Unwritten extent conversions on random direct writes
are especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead. And that's with (now dynamic) debugging turned on!
These items have been addressed since the third RFC:
1. fuse2fs has been forked into fuse4fs, which now talks to the low
level fuse interface. This avoids all the path walking that the
high level fuse library provides, which dramatically improves the
performance of fuse4fs. fstests runs in half the time now. Many
thanks to Amir Goldstein for giving me a rough draft of the
conversion!
2. I simplified the configuration protocols -- now there's a per-fs
bit to enable any iomap, and a per-inode bit to enable iomap on a
specific file. Registration of iomap devices now uses the backing
fd registration interface.
3. You can now specify the root nodeid for any fuse mount.
4. Atomic writes are working, at least for single fsblocks.
5. I've ported the cache implementation from xfsprogs to e2fsprogs
libsupport, so the inode and buffer caches can now dynamically grow
to support larger working sets. No more fixed-size caches!
6. Cleaned up the kernel/libfuse ABI quite a bit.
7. fstests passes 97% of the tests that run, when iomap is enabled!
Only 93% pass when iomap is disabled, and I think that's due to some
bugs in the ACL and mode handling code.
There are some major warts remaining:
a. I've a /much/ clearer picture of how one might containerize a
filesystem server, thanks to a lot of input from Christian Brauner
in response to v3. I think I have enough pieces to try setting up
a fd-passing interface into a systemd service ... but I haven't
actually written any of it yet.
b. fsdax isn't implemented. I think I'm going to work on this for
RFC v5 to see if we can simplify the file mapping handling in famfs.
If not, then everyone else gets fsdax for free.
c. ext4 doesn't support out of place writes so I don't know if that
actually works correctly.
d. I've not yet consolidated struct fuse_inode, so the iomap gunk still
eats rather a lot of space per inode.
e. fuse2fs doesn't support the ext4 journal. Urk.
f. There's a VERY large quantity of fuse2fs improvements that need to be
applied before we get to the fuse-iomap parts. I'm not sending these
(or the fstests changes) to keep the size of the patchbomb at
"unreasonably large". :P
I'll work on these in August/Steptember, but for now here's an
unmergeable RFC to start some discussion.
--Darrick
Powered by blists - more mailing lists