lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <20250609223159.GB6138@frogsfrogsfrogs> Date: Mon, 9 Jun 2025 15:31:59 -0700 From: "Darrick J. Wong" <djwong@...nel.org> To: Amir Goldstein <amir73il@...il.com> Cc: linux-fsdevel <linux-fsdevel@...r.kernel.org>, John@...ves.net, bernd@...ernd.com, miklos@...redi.hu, joannelkoong@...il.com, Josef Bacik <josef@...icpanda.com>, linux-ext4 <linux-ext4@...r.kernel.org>, Theodore Ts'o <tytso@....edu> Subject: Re: [RFC[RAP]] fuse: use fs-iomap for better performance so we can containerize ext4 On Thu, May 29, 2025 at 09:41:23PM +0200, Amir Goldstein wrote: > or > > On Thu, May 29, 2025 at 6:45 PM Darrick J. Wong <djwong@...nel.org> wrote: > > > > On Thu, May 22, 2025 at 06:24:50PM +0200, Amir Goldstein wrote: > > > On Thu, May 22, 2025 at 1:58 AM Darrick J. Wong <djwong@...nel.org> wrote: > > > > > > > > Hi everyone, > > > > > > > > DO NOT MERGE THIS. > > > > > > > > This is the very first request for comments of a prototype to connect > > > > the Linux fuse driver to fs-iomap for regular file IO operations to and > > > > from files whose contents persist to locally attached storage devices. > > > > > > > > Why would you want to do that? Most filesystem drivers are seriously > > > > vulnerable to metadata parsing attacks, as syzbot has shown repeatedly > > > > over almost a decade of its existence. Faulty code can lead to total > > > > kernel compromise, and I think there's a very strong incentive to move > > > > all that parsing out to userspace where we can containerize the fuse > > > > server process. > > > > > > > > willy's folios conversion project (and to a certain degree RH's new > > > > mount API) have also demonstrated that treewide changes to the core > > > > mm/pagecache/fs code are very very difficult to pull off and take years > > > > because you have to understand every filesystem's bespoke use of that > > > > core code. Eeeugh. > > > > > > > > The fuse command plumbing is very simple -- the ->iomap_begin, > > > > ->iomap_end, and iomap ioend calls within iomap are turned into upcalls > > > > to the fuse server via a trio of new fuse commands. This is suitable > > > > for very simple filesystems that don't do tricky things with mappings > > > > (e.g. FAT/HFS) during writeback. This isn't quite adequate for ext4, > > > > but solving that is for the next sprint. > > > > > > > > With this overly simplistic RFC, I am to show that it's possible to > > > > build a fuse server for a real filesystem (ext4) that runs entirely in > > > > userspace yet maintains most of its performance. At this early stage I > > > > get about 95% of the kernel ext4 driver's streaming directio performance > > > > on streaming IO, and 110% of its streaming buffered IO performance. > > > > Random buffered IO suffers a 90% hit on writes due to unwritten extent > > > > conversions. Random direct IO is about 60% as fast as the kernel; see > > > > the cover letter for the fuse2fs iomap changes for more details. > > > > > > > > > > Very cool! > > > > > > > There are some major warts remaining: > > > > > > > > 1. The iomap cookie validation is not present, which can lead to subtle > > > > races between pagecache zeroing and writeback on filesystems that > > > > support unwritten and delalloc mappings. > > > > > > > > 2. Mappings ought to be cached in the kernel for more speed. > > > > > > > > 3. iomap doesn't support things like fscrypt or fsverity, and I haven't > > > > yet figured out how inline data is supposed to work. > > > > > > > > 4. I would like to be able to turn on fuse+iomap on a per-inode basis, > > > > which currently isn't possible because the kernel fuse driver will iget > > > > inodes prior to calling FUSE_GETATTR to discover the properties of the > > > > inode it just read. > > > > > > Can you make the decision about enabling iomap on lookup? > > > The plan for passthrough for inode operations was to allow > > > setting up passthough config of inode on lookup. > > > > The main requirement (especially for buffered IO) is that we've set the > > address space operations structure either to the regular fuse one or to > > the fuse+iomap ops before clearing INEW because the iomap/buffered-io.c > > code assumes that cannot change on a live inode. > > > > So I /think/ we could ask the fuse server at inode instantiation time > > (which, if I'm reading the code correctly, is when iget5_locked gives > > fuse an INEW inode and calls fuse_init_inode) provided it's ok to upcall > > to userspace at that time. Alternately I guess we could extend struct > > fuse_attr with another FUSE_ATTR_ flag, I think? > > > > The latter. Either extend fuse_attr or struct fuse_entry_out, > which is in the responses of FUSE_LOOKUP, > FUSE_READDIRPLUS, FUSE_CREATE, FUSE_TMPFILE. > which instantiate fuse inodes. > > There is a very hand wavy discussion about this at: > https://lore.kernel.org/linux-fsdevel/CAOQ4uxi2w+S4yy3yiBvGpJYSqC6GOTAZQzzjygaH3TjH7Uc4+Q@mail.gmail.com/ > > In a nutshell, we discussed adding a new FUSE_LOOKUP_HANDLE > command that uses the variable length file handle instead of nodeid > as a key for the inode. > > So we will have to extend fuse_entry_out anyway, but TBH I never got to > look at the gritty details of how best to extend all the relevant commands, > so I hope I am not sending you down the wrong path. I found another twist to this story: the upper level libfuse3 library assigns distinct nodeids for each directory entry. These nodeids are passed into the kernel and appear to the basis for an iget5_locked call. IOWs, each nodeid causes a struct fuse_inode to be created in the kernel. For a single-linked file this is no big deal, but for a hardlink this makes iomap a mess because this means that in fuse2fs, an ext2 inode can map to multiple kernel fuse_inode objects. This /really/ breaks the locking model of iomap, which assumes that there's one in-kernel inode and that it can use i_rwsem to synchronize updates. So I'm going to have to find a way to deal with this. I tried trivially messing with libfuse nodeid assigment but that blew some assertion. Maybe your LOOKUP_HANDLE thing would work. > > > > 5. ext4 doesn't support out of place writes so I don't know if that > > > > actually works correctly. > > > > > > > > 6. iomap is an inode-based service, not a file-based service. This > > > > means that we /must/ push ext2's inode numbers into the kernel via > > > > FUSE_GETATTR so that it can report those same numbers back out through > > > > the FUSE_IOMAP_* calls. However, the fuse kernel uses a separate nodeid > > > > to index its incore inode, so we have to pass those too so that > > > > notifications work properly. > > > > > > > > > > Again, I might be missing something, but as long as the fuse filesystem > > > is exposing a single backing filesystem, it should be possible to make > > > sure (via opt-in) that fuse nodeid's are equivalent to the backing fs > > > inode number. > > > See sketch in this WIP branch: > > > https://github.com/amir73il/linux/commit/210f7a29a51b085ead9f555978c85c9a4a503575 > > > > I think this would work in many places, except for filesystems with > > 64-bit inumbers on 32-bit machines. That might be a good argument for > > continuing to pass along the nodeid and fuse_inode::orig_ino like it > > does now. Plus there are some filesystems that synthesize inode numbers > > so tying the two together might not be feasible/desirable anyway. > > > > Though one nice feature of letting fuse have its own nodeids might be > > that if the in-memory index switches to a tree structure, then it could > > be more compact if the filesystem's inumbers are fairly sparse like xfs. > > OTOH the current inode hashtable has been around for a very long time so > > that might not be a big concern. For fuse2fs it doesn't matter since > > ext4 inumbers are u32. > > > > I wanted to see if declaring one-to-one 64bit ino can simplify things > for the first version of inode ops passthrough. > If this is not the case, or if this is too much of a limitation for > your use case > then nevermind. > But if it is a good enough shortcut for the demo and can be extended later, > then why not. It's very tempting, because it's very confusing to have nodeids and stat st_ino not be the same thing. --D > Thanks, > Amir. >
Powered by blists - more mailing lists