lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250515020624.GP1035866@frogsfrogsfrogs>
Date: Wed, 14 May 2025 19:06:24 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Miklos Szeredi <miklos@...redi.hu>
Cc: John Groves <John@...ves.net>, Dan Williams <dan.j.williams@...el.com>,
	Bernd Schubert <bschubert@....com>,
	John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>,
	Vishal Verma <vishal.l.verma@...el.com>,
	Dave Jiang <dave.jiang@...el.com>,
	Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>,
	Luis Henriques <luis@...lia.com>,
	Randy Dunlap <rdunlap@...radead.org>,
	Jeff Layton <jlayton@...nel.org>,
	Kent Overstreet <kent.overstreet@...ux.dev>,
	Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
	nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>,
	Jonathan Cameron <Jonathan.Cameron@...wei.com>,
	Stefan Hajnoczi <shajnocz@...hat.com>,
	Joanne Koong <joannelkoong@...il.com>,
	Josef Bacik <josef@...icpanda.com>,
	Aravind Ramesh <arramesh@...ron.com>,
	Ajay Joshi <ajayjoshi@...ron.com>
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps

On Tue, May 13, 2025 at 11:14:55AM +0200, Miklos Szeredi wrote:
> On Thu, 8 May 2025 at 17:56, Darrick J. Wong <djwong@...nel.org> wrote:
> 
> > Well right now my barely functional prototype exposes this interface
> > for communicating mappings to the kernel.  I've only gotten as far as
> > exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
> > server with no caching, because the only functions I've implemented so
> > far are FIEMAP, SEEK_{DATA,HOLE}, and directio.
> >
> > So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
> > desired (pos, count) file range to the fuse server, which responds with
> > a struct fuse_iomap_begin_out object that is translated into a struct
> > iomap.
> >
> > The fuse server then responds with a read mapping and a write mapping,
> > which tell the kernel from where to read data, and where to write data.
> 
> So far so good.
> 
> The iomap layer is non-caching, right?   This means that e.g. a
> direct_io request spanning two extents will result in two separate
> requests, since one FUSE_IOMAP_BEGIN can only return one extent.

Originally it wasn't supposed to be cached at all.  Then history taught
us a lesson. :P

In hindsight, there needs to be coordination of the space mapping
manipulations that go on between pagecache writes and reclaim writeback.
Pagecache write can get an unwritten iomap, then go to sleep while it
tries to get a folio.  In the meantime, writeback can find the folio for
that range, write it back to the disk (which converts unwritten to
written) and reclaim the folio.  Now the first process wakes up and
grabs a new folio.  Because its unwritten mapping is now stale, it must
not start zeroing that folio; it needs to go get a new mapping.

So iomap still doesn't need caching per se, but it needs writer threads
to revalidate the mapping after locking a folio.  The reason for caching
iomaps under the fuse_inode somewhere is that I don't want the
revalidations to have to jump all the way out to userspace with a folio
lock held.

That said, on a VM on this 12 year old workstation, I can get about
2.0GB/s direct writes in fuse2fs and 2.2GB/s in kernel ext4, and that's
with initiating iomap_begin/end/ioends with no caching of the mappings.
Pagecache writes run at about 1.9GB/s through fuse2fs and 1.5GB/s
through the kernel, but only if I tweak fuse to use large folios and a
relatively unconstrained bdi.  2GB/s might be enough IO for anyone. ;)

> And the next direct_io request may need to repeat the query for the
> same extent as the previous one if the I/O boundary wasn't on the
> extent boundary (which is likely).
> 
> So some sort of caching would make sense, but seeing the multitude of
> FUSE_IOMAP_OP_ types I'm not clearly seeing how that would look.

Yeah, it's confusing.  The design doc tries to clarify this, but this is
roughly what we need for fuse:

FUSE_IOMAP_OP_WRITE being set means we're writing to the file.
FUSE_IOMAP_OP_ZERO being set means we're zeroing the file.
Neither of those being set means we're reading the file.

(3 different operations)

FUSE_IOMAP_OP_DIRECT being set means directio, and it not being set
means pagecache.

(and one flag, for 6 different types of IO)

FUSE_IOMAP_OP_REPORT is set all by itself for things like FIEMAP and
SEEK_DATA/HOLE.

> > I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
> > If so, then I think that's the wrong layer -- INVAL_INODE invalidates
> > the page cache, whereas I'm talking about caching the file space
> > mappings that iomap uses to construct bios for disk IO, and possibly
> > wanting to invalidate parts of that cache to force the kernel to upcall
> > the fuse server for a new mapping.
> 
> Maybe I'm confused, as the layering is not very clear in my head yet.
> 
> But in your example you did say that invalidation of data as well as
> mapping needs to be invalidated, so I thought that the simplest thing
> to do is to just invalidate the cached mapping from
> FUSE_NOTIFY_INVAL_INODE as well.

For now I want to keep the two invalidation types separate while I build
out more of the prototype so that I can be more sure that I haven't
broken any existing code. :)

The mapping invalidation might be more useful for things like FICLONE on
weird filesystems where the file allocation unit size is larger than the
block size and we actually need to invalidate more mappings than the vfs
knows about.

But I'm only 80% sure of that, as I'm still figuring out how to create a
notification and send it from fuse2fs and haven't gotten to the caching
layer yet.

--D

> Thanks,
> Miklos
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ