lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <174787198370.1484996.3340565971108603226.stgit@frogsfrogsfrogs> Date: Wed, 21 May 2025 17:02:25 -0700 From: "Darrick J. Wong" <djwong@...nel.org> To: tytso@....edu Cc: John@...ves.net, linux-ext4@...r.kernel.org, miklos@...redi.hu, joannelkoong@...il.com, bernd@...ernd.com, linux-fsdevel@...r.kernel.org Subject: [PATCHSET RFC[RAP] 3/3] fuse2fs: use fuse iomap data paths for better file I/O performance Hi all, Switch fuse2fs to use the new iomap file data IO paths instead of pushing it very slowly through the /dev/fuse connection. For local filesystems, all we have to do is respond to requests for file to device mappings; the rest of the IO hot path stays within the kernel. This means that we can get rid of all file data block processing within fuse2fs. Because we're not pinning dirty pages through a potentially slow network connection, we don't need the heavy BDI throttling for which most fuse servers have become infamous. Yes, mapping lookups for writeback can stall, but mappings are small as compared to data and this situation exists for all kernel filesystems as well. The performance of this new data path is quite stunning: on a warm system, streaming reads and writes through the pagecache go from 60-90MB/s to 2-2.5GB/s. Direct IO reads and writes improve from the same baseline to 2.5-8GB/s. FIEMAP and SEEK_DATA/SEEK_HOLE now work too. The kernel ext4 driver can manage about 1.6GB/s for pagecache IO and about 2.6-8.5GB/s, which means that fuse2fs is about as fast as the kernel for streaming file IO. Random 4k buffered IO is not so good: plain fuse2fs pokes along at 25-50MB/s, whereas fuse2fs with iomap manages 90-1300MB/s. The kernel can do 900-1300MB/s. Random directio is worse: plain fuse2fs does 20-30MB/s, fuse-iomap does about 30-35MB/s, and the kernel does 40-55MB/s. I suspect that metadata heavy workloads do not perform well on fuse2fs because libext2fs wasn't designed for that and it doesn't even have a journal to absorb all the fsync writes. We also probably need iomap caching really badly. These performance numbers are slanted: my machine is 12 years old, and fuse2fs is VERY poorly optimized for performance. It contains a single Big Filesystem Lock which nukes multi-threaded scalability. There's no inode cache nor is there a proper buffer cache, which means that fuse2fs reads metadata in from disk and checksums it on EVERY ACCESS. Sad! Despite these gaps, this RFC demonstrates that it's feasible to run the metadata parsing parts of a filesystem in userspace while not sacrificing much performance. We now have a vehicle to move the filesystems out of the kernel, where they can be containerized so that malicious filesystems can be contained, somewhat. iomap mode also calls FUSE_DESTROY before unmounting the filesystem, so for capable systems, fuse2fs doesn't need to run in fuseblk mode anymore. However, there are some major warts remaining: 1. The iomap cookie validation is not present, which can lead to subtle races between pagecache zeroing and writeback on filesystems that support unwritten and delalloc mappings. 2. Mappings ought to be cached in the kernel for more speed. 3. iomap doesn't support things like fscrypt or fsverity, and I haven't yet figured out how inline data is supposed to work. 4. I would like to be able to turn on fuse+iomap on a per-inode basis, which currently isn't possible because the kernel fuse driver will iget inodes prior to calling FUSE_GETATTR to discover the properties of the inode it just read. 5. ext4 doesn't support out of place writes so I don't know if that actually works correctly. 6. iomap is an inode-based service, not a file-based service. This means that we /must/ push ext2's inode numbers into the kernel via FUSE_GETATTR so that it can report those same numbers back out through the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid to index its incore inode, so we have to pass those too so that notifications work properly. I'll work on these in June, but for now here's an unmergeable RFC to start some discussion. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. Comments and questions are, as always, welcome. e2fsprogs git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/e2fsprogs.git/log/?h=fuse2fs-iomap --- Commits in this patchset: * fuse2fs: implement bare minimum iomap for file mapping reporting * fuse2fs: register block devices for use with iomap * fuse2fs: always use directio disk reads with fuse2fs * fuse2fs: implement directio file reads * fuse2fs: use tagged block IO for zeroing sub-block regions * fuse2fs: only flush the cache for the file under directio read * fuse2fs: add extent dump function for debugging * fuse2fs: implement direct write support * fuse2fs: turn on iomap for pagecache IO * fuse2fs: flush and invalidate the buffer cache on trim * fuse2fs: improve tracing for fallocate * fuse2fs: don't zero bytes in punch hole * fuse2fs: don't do file data block IO when iomap is enabled * fuse2fs: disable most io channel flush/invalidate in iomap pagecache mode * fuse2fs: re-enable the block device pagecache for metadata IO * fuse2fs: avoid fuseblk mode if fuse-iomap support is likely --- configure | 47 ++ configure.ac | 32 + lib/config.h.in | 3 misc/fuse2fs.c | 1251 ++++++++++++++++++++++++++++++++++++++++++++++++++++++- 4 files changed, 1312 insertions(+), 21 deletions(-)
Powered by blists - more mailing lists