lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20250508155644.GM1035866@frogsfrogsfrogs>
Date: Thu, 8 May 2025 08:56:44 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: Miklos Szeredi <miklos@...redi.hu>
Cc: John Groves <John@...ves.net>, Dan Williams <dan.j.williams@...el.com>,
	Bernd Schubert <bschubert@....com>,
	John Groves <jgroves@...ron.com>, Jonathan Corbet <corbet@....net>,
	Vishal Verma <vishal.l.verma@...el.com>,
	Dave Jiang <dave.jiang@...el.com>,
	Matthew Wilcox <willy@...radead.org>, Jan Kara <jack@...e.cz>,
	Alexander Viro <viro@...iv.linux.org.uk>,
	Christian Brauner <brauner@...nel.org>,
	Luis Henriques <luis@...lia.com>,
	Randy Dunlap <rdunlap@...radead.org>,
	Jeff Layton <jlayton@...nel.org>,
	Kent Overstreet <kent.overstreet@...ux.dev>,
	Petr Vorel <pvorel@...e.cz>, Brian Foster <bfoster@...hat.com>,
	linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
	nvdimm@...ts.linux.dev, linux-cxl@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, Amir Goldstein <amir73il@...il.com>,
	Jonathan Cameron <Jonathan.Cameron@...wei.com>,
	Stefan Hajnoczi <shajnocz@...hat.com>,
	Joanne Koong <joannelkoong@...il.com>,
	Josef Bacik <josef@...icpanda.com>,
	Aravind Ramesh <arramesh@...ron.com>,
	Ajay Joshi <ajayjoshi@...ron.com>, 0@...ves.net
Subject: Re: [RFC PATCH 13/19] famfs_fuse: Create files with famfs fmaps

On Tue, May 06, 2025 at 06:56:29PM +0200, Miklos Szeredi wrote:
> On Mon, 28 Apr 2025 at 21:00, Darrick J. Wong <djwong@...nel.org> wrote:
> 
> > <nod> I don't know what Miklos' opinion is about having multiple
> > fusecmds that do similar things -- on the one hand keeping yours and my
> > efforts separate explodes the amount of userspace abi that everyone must
> > maintain, but on the other hand it then doesn't couple our projects
> > together, which might be a good thing if it turns out that our domain
> > models are /really/ actually quite different.
> 
> Sharing the interface at least would definitely be worthwhile, as
> there does not seem to be a great deal of difference between the
> generic one and the famfs specific one.  Only implementing part of the
> functionality that the generic one provides would be fine.

Well right now my barely functional prototype exposes this interface
for communicating mappings to the kernel.  I've only gotten as far as
exposing the ->iomap_{begin,end} and ->iomap_ioend calls to the fuse
server with no caching, because the only functions I've implemented so
far are FIEMAP, SEEK_{DATA,HOLE}, and directio.

So basically the kernel sends a FUSE_IOMAP_BEGIN command with the
desired (pos, count) file range to the fuse server, which responds with
a struct fuse_iomap_begin_out object that is translated into a struct
iomap.

The fuse server then responds with a read mapping and a write mapping,
which tell the kernel from where to read data, and where to write data.
As a shortcut, the write mapping can be of type
FUSE_IOMAP_TYPE_PURE_OVERWRITE to avoid having to fill out fields twice.

iomap_end is only called if there were errors while processing the
mapping, or if the fuse server sets FUSE_IOMAP_F_WANT_IOMAP_END.

iomap_ioend is called after read or write IOs complete, so that the
filesystem can update mapping metadata (e.g. unwritten extent
conversion, remapping after an out of place write, ondisk isize update).

Some of the flags here might not be needed or workable; I was merely
cutting and pasting the #defines from iomap.h.

#define FUSE_IOMAP_TYPE_PURE_OVERWRITE	(0xFFFF) /* use read mapping data */
#define FUSE_IOMAP_TYPE_HOLE		0	/* no blocks allocated, need allocation */
#define FUSE_IOMAP_TYPE_DELALLOC	1	/* delayed allocation blocks */
#define FUSE_IOMAP_TYPE_MAPPED		2	/* blocks allocated at @addr */
#define FUSE_IOMAP_TYPE_UNWRITTEN	3	/* blocks allocated at @addr in unwritten state */
#define FUSE_IOMAP_TYPE_INLINE		4	/* data inline in the inode */

#define FUSE_IOMAP_DEV_SBDEV		(0)	/* use superblock bdev */

#define FUSE_IOMAP_F_NEW		(1U << 0)
#define FUSE_IOMAP_F_DIRTY		(1U << 1)
#define FUSE_IOMAP_F_SHARED		(1U << 2)
#define FUSE_IOMAP_F_MERGED		(1U << 3)
#define FUSE_IOMAP_F_XATTR		(1U << 5)
#define FUSE_IOMAP_F_BOUNDARY		(1U << 6)
#define FUSE_IOMAP_F_ANON_WRITE		(1U << 7)

#define FUSE_IOMAP_F_WANT_IOMAP_END	(1U << 15) /* want ->iomap_end call */

#define FUSE_IOMAP_OP_WRITE		(1 << 0) /* writing, must allocate blocks */
#define FUSE_IOMAP_OP_ZERO		(1 << 1) /* zeroing operation, may skip holes */
#define FUSE_IOMAP_OP_REPORT		(1 << 2) /* report extent status, e.g. FIEMAP */
#define FUSE_IOMAP_OP_FAULT		(1 << 3) /* mapping for page fault */
#define FUSE_IOMAP_OP_DIRECT		(1 << 4) /* direct I/O */
#define FUSE_IOMAP_OP_NOWAIT		(1 << 5) /* do not block */
#define FUSE_IOMAP_OP_OVERWRITE_ONLY	(1 << 6) /* only pure overwrites allowed */
#define FUSE_IOMAP_OP_UNSHARE		(1 << 7) /* unshare_file_range */
#define FUSE_IOMAP_OP_ATOMIC		(1 << 9) /* torn-write protection */
#define FUSE_IOMAP_OP_DONTCACHE		(1 << 10) /* dont retain pagecache */

#define FUSE_IOMAP_NULL_ADDR		-1ULL	/* addr is not valid */

struct fuse_iomap_begin_in {
	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
	uint32_t reserved;
	uint64_t ino;		/* matches st_ino provided by getattr/open */
	uint64_t pos;		/* file position, in bytes */
	uint64_t count;		/* operation length, in bytes */
};

struct fuse_iomap_begin_out {
	uint64_t offset;	/* file offset of mapping, bytes */
	uint64_t length;	/* length of both mappings, bytes */

	uint64_t read_addr;	/* disk offset of mapping, bytes */
	uint16_t read_type;	/* FUSE_IOMAP_TYPE_* */
	uint16_t read_flags;	/* FUSE_IOMAP_F_* */
	uint32_t read_dev;	/* FUSE_IOMAP_DEV_* */

	uint64_t write_addr;	/* disk offset of mapping, bytes */
	uint16_t write_type;	/* FUSE_IOMAP_TYPE_* */
	uint16_t write_flags;	/* FUSE_IOMAP_F_* */
	uint32_t write_dev;	/* FUSE_IOMAP_DEV_* */
};

struct fuse_iomap_end_in {
	uint32_t opflags;	/* FUSE_IOMAP_OP_* */
	uint32_t reserved;
	uint64_t ino;		/* matches st_ino provided iomap_begin */
	uint64_t pos;		/* file position, in bytes */
	uint64_t count;		/* operation length, in bytes */
	int64_t written;	/* bytes processed */

	uint64_t map_length;	/* length of mapping, bytes */
	uint64_t map_addr;	/* disk offset of mapping, bytes */
	uint16_t map_type;	/* FUSE_IOMAP_TYPE_* */
	uint16_t map_flags;	/* FUSE_IOMAP_F_* */
	uint32_t map_dev;	/* FUSE_IOMAP_DEV_* */
};

/* out of place write extent */
#define FUSE_IOMAP_IOEND_SHARED		(1U << 0)
/* unwritten extent */
#define FUSE_IOMAP_IOEND_UNWRITTEN	(1U << 1)
/* don't merge into previous ioend */
#define FUSE_IOMAP_IOEND_BOUNDARY	(1U << 2)
/* is direct I/O */
#define FUSE_IOMAP_IOEND_DIRECT		(1U << 3)

/* is append ioend */
#define FUSE_IOMAP_IOEND_APPEND		(1U << 15)

struct fuse_iomap_ioend_in {
	uint16_t ioendflags;	/* FUSE_IOMAP_IOEND_* */
	uint16_t reserved;
	int32_t error;		/* negative errno or 0 */
	uint64_t ino;		/* matches st_ino provided iomap_begin */
	uint64_t pos;		/* file position, in bytes */
	uint64_t addr;		/* disk offset of new mapping, in bytes */
	uint32_t written;	/* bytes processed */
	uint32_t reserved1;
};

> > (Especially because I suspect that interleaving is the norm for memory,
> > whereas we try to avoid that for disk filesystems.)
> 
> So interleaved extents are just like normal ones except they repeat,
> right?  What about adding a special "repeat last N extent
> descriptions" type of extent?

Yeah, I suppose a mapping cache could do that.  From talking to John
last week, it sounds like the mappings are supposed to be static for the
life of the file, as opposed to ext* where truncates and fallocate can
appear at any time.

One thing I forgot to ask John -- can there be multiple sets of
interleaved mappings per file?  e.g. the first 32g of a file are split
between 4 memory controllers, whereas the next 64g are split between 4
different domains?

> > > But the current implementation does not contemplate partially cached fmaps.
> > >
> > > Adding notification could address revoking them post-haste (is that why
> > > you're thinking about notifications? And if not can you elaborate on what
> > > you're after there?).
> >
> > Yeah, invalidating the mapping cache at random places.  If, say, you
> > implement a clustered filesystem with iomap, the metadata server could
> > inform the fuse server on the local node that a certain range of inode X
> > has been written to, at which point you need to revoke any local leases,
> > invalidate the pagecache, and invalidate the iomapping cache to force
> > the client to requery the server.
> >
> > Or if your fuse server wants to implement its own weird operations (e.g.
> > XFS EXCHANGE-RANGE) this would make that possible without needing to
> > add a bunch of code to fs/fuse/ for the benefit of a single fuse driver.
> 
> Wouldn't existing invalidation framework be sufficient?

I'm a little confused, are you talking about FUSE_NOTIFY_INVAL_INODE?
If so, then I think that's the wrong layer -- INVAL_INODE invalidates
the page cache, whereas I'm talking about caching the file space
mappings that iomap uses to construct bios for disk IO, and possibly
wanting to invalidate parts of that cache to force the kernel to upcall
the fuse server for a new mapping.

(Obviously this only applies to fuse servers for ondisk filesystems.)

--D

> Thanks,
> Miklos
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ