lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251029002755.GK6174@frogsfrogsfrogs>
Date: Tue, 28 Oct 2025 17:27:55 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: linux-fsdevel <linux-fsdevel@...r.kernel.org>
Cc: Miklos Szeredi <miklos@...redi.hu>, Bernd Schubert <bernd@...ernd.com>,
	Joanne Koong <joannelkoong@...il.com>,
	linux-ext4 <linux-ext4@...r.kernel.org>,
	Theodore Ts'o <tytso@....edu>, Neal Gompa <neal@...pa.dev>,
	Amir Goldstein <amir73il@...il.com>,
	Christian Brauner <brauner@...nel.org>,
	Jeff Layton <jlayton@...nel.org>
Subject: [PATCHBOMB v6] fuse: containerize ext4 for safer operation

Look ma, no more RFC tag!

This is the sixth public draft of a prototype to connect the Linux fuse
driver to fs-iomap for regular file IO operations to and from files
whose contents persist to locally attached storage devices.  With this
release, I show that it's possible to build a fuse server for a real
filesystem (ext4) that runs entirely in userspace yet maintains most of
its performance.  Furthermore, I also show that the userspace program
runs with minimal privilege, which means that we no longer need to have
filesystem metadata parsing be a privileged (== risky) operation.

Why would you want to do that?  Most filesystem drivers are seriously
vulnerable to metadata parsing attacks, as syzbot has shown repeatedly
over almost a decade of its existence.  Faulty code can lead to total
kernel compromise, and I think there's a very strong incentive to move
all that parsing out to userspace where we can containerize the fuse
server process.

willy's folios conversion project (and to a certain degree RH's new
mount API) have also demonstrated that treewide changes to the core
mm/pagecache/fs code are very very difficult to pull off and take years
because you have to understand every filesystem's bespoke use of that
core code.  Eeeugh.

The fuse command plumbing is very simple -- the ->iomap_begin,
->iomap_end, and iomap ->ioend calls within iomap are turned into
upcalls to the fuse server via a trio of new fuse commands.  Pagecache
writeback is now a directio write.  The fuse server is now able to
upsert mappings into the kernel for cached access (== zero upcalls for
rereads and pure overwrites!) and the iomap cache revalidation code
works.

At this stage I still get about 95% of the kernel ext4 driver's
streaming directio performance on streaming IO, and 110% of its
streaming buffered IO performance.  Random buffered IO is about 85% as
fast as the kernel.  Random direct IO is about 80% as fast as the
kernel; see the cover letter for the fuse2fs iomap changes for more
details.  Unwritten extent conversions on random direct writes are
especially painful for fuse+iomap (~90% more overhead) due to upcall
overhead.  And that's with (now dynamic) debugging turned on!

These items have been addressed since the fifth RFC:

1. After seven months of work, I can get seven of my 15 or so testing
   profiles to pass fstests, most days.  There are a few flakey tests
   like generic/347 that (I think) sometimes fail because there's no
   journalling in jbd2.  That's better than kernel ext4, which never
   gets all the way to passing here.

2. Swap files, filesystem freeze and thaw, and shutdowns now work.

3. fuse4fs can now use PSI information as a clue that it's time for it
   to flush its caches and evict them.

There are some warts remaining:

a. I would like to start a discussion about how the design review of
   this code should be structured, and how might I go about creating new
   userspace filesystem servers -- lightweight new ones based off the
   existing userspace tools?  Or by merging lklfuse?

b. ext4 doesn't support out of place writes so I don't know if that
   actually works correctly.

c. fuse2fs doesn't support the ext4 journal.  Urk.

d. There's a VERY large quantity of fuse2fs improvements that need to be
   applied before we get to the fuse-iomap parts.  I'm not sending these
   (or the fstests changes) to keep the size of the patchbomb at
   "unreasonably large". :P  As a result, the fstests and e2fsprogs
   postings are very targeted.

I'll work on these in November, but now I'm much more serious about
getting this merged for 6.19 now that the LTS is past and the coast is
clear.

--Darrick

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ