linux-kernel - Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20241105000956.GJ21832@frogsfrogsfrogs>
Date: Mon, 4 Nov 2024 16:09:56 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Dave Chinner <david@...morbit.com>
Cc: Ritesh Harjani <ritesh.list@...il.com>, Theodore Ts'o <tytso@....edu>,
	John Garry <john.g.garry@...cle.com>, linux-ext4@...r.kernel.org,
	Jan Kara <jack@...e.cz>, Christoph Hellwig <hch@...radead.org>,
	Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-kernel@...r.kernel.org,
	linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes

On Mon, Nov 04, 2024 at 12:52:42PM +1100, Dave Chinner wrote:
> On Thu, Oct 31, 2024 at 02:36:40PM -0700, Darrick J. Wong wrote:
> > On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> > > > This gets me to the third and much less general solution -- only allow
> > > > untorn writes if we know that the ioend only ever has to run a single
> > > > transaction.  That's why untorn writes are limited to a single fsblock
> > > > for now -- it's a simple solution so that we can get our downstream
> > > > customers to kick the tires and start on the next iteration instead of
> > > > spending years on waterfalling.
> > > >
> > > > Did you notice that in all of these cases, the capabilities of the
> > > > filesystem's ioend processing determines the restrictions on the number
> > > > and type of mappings that ->iomap_begin can give to iomap?
> > > >
> > > > Now that we have a second system trying to hook up to the iomap support,
> > > > it's clear to me that the restrictions on mappings are specific to each
> > > > filesystem.  Therefore, the iomap directio code should not impose
> > > > restrictions on the mappings it receives unless they would prevent the
> > > > creation of the single aligned bio.
> > > >
> > > > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > > > EINVAL or something if they look at the file mappings and discover that
> > > > they cannot perform the ioend without risking torn mapping updates.  In
> > > > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > > > really belongs, but hold that thought.
> > > >
> > > > For the multi fsblock case, the ->iomap_begin functions would have to
> > > > check that only one metadata update would be necessary in the ioend.
> > > > That's where things get murky, since ext4/xfs drop their mapping locks
> > > > between calls to ->iomap_begin.  So you'd have to check all the mappings
> > > > for unsupported mixed state every time.  Yuck.
> > > >
> > > 
> > > Thanks Darrick for taking time summarizing what all has been done
> > > and your thoughts here.
> > > 
> > > > It might be less gross to retain the restriction that iomap accepts only
> > > > one mapping for the entire file range, like Ritesh has here.
> > > 
> > > less gross :) sure. 
> > > 
> > > I would like to think of this as, being less restrictive (compared to
> > > only allowing a single fsblock) by adding a constraint on the atomic
> > > write I/O request i.e.  
> > > 
> > > "Atomic write I/O request to a region in a file is only allowed if that
> > > region has no partially allocated extents. Otherwise, the file system
> > > can fail the I/O operation by returning -EINVAL."
> > > 
> > > Essentially by adding this constraint to the I/O request, we are
> > > helping the user to prevent atomic writes from accidentally getting
> > > torned and also allowing multi-fsblock writes. So I still think that
> > > might be the right thing to do here or at least a better start. FS can
> > > later work on adding such support where we don't even need above
> > > such constraint on a given atomic write I/O request.
> > 
> > On today's ext4 call, Ted and Ritesh and I realized that there's a bit
> > more to it than this -- it's not possible to support untorn writes to a
> > mix of written/(cow,unwritten) mappings even if they all point to the
> > same physical space.  If the system fails after the storage device
> > commits the write but before any of the ioend processing is scheduled, a
> > subsequent read of the previously written blocks will produce the new
> > data, but reads to the other areas will produce the old contents (or
> > zeroes, or whatever).  That's a torn write.
> 
> I'm *really* surprised that people are only realising that IO
> completion processing for atomic writes *must be atomic*.

I've been saying for a while; it's just that I didn't realize until last
week that there were more rules than "can't do an untorn write if you
need to do more than 1 mapping update":

Untorn writes are not possible if:
1. More than 1 mapping update is needed
2. 1 mapping update is needed but there's a written block

(1) can be worked around with a log intent item for ioend processing,
but I don't think (2) can at all.  That's why I went back to saying that
untorn writes require that there can only be one mapping.

> This was the foundational constraint that the forced alignment
> proposal for XFS was intended to address. i.e. to prevent fs
> operations from violating atomic write IO constraints (e.g. punching
> sub-atomic write size holes in the file) so that the physical IO can
> be done without tearing and the IO completion processing that
> exposes the new data can be done atomically.
> 
> > Therefore, iomap ought to stick to requiring that ->iomap_begin returns
> > a single iomap to cover the entire file range for the untorn write.  For
> > an unwritten extent, the post-recovery read will see either zeroes or
> > the new contents; for a single-mapping COW it'll see old or new contents
> > but not both.
> 
> I'm pretty sure we enforced that in the XFS mapping implemention for
> atomic writes using forced alignment. i.e.  we have to return a
> correctly aligned, contiguous mapping to iomap or we have to return
> -EINVAL to indicate atomic write mapping failed.

I think you guys did, it's just the ext4 bigalloc thing that started
this back up again. :/

> Yes, we can check this in iomap, but it's really the filesystem that
> has to implement and enforce it...

I think we ought to keep the "1 mapping per untorn write" check in iomap
until someone decides that it's worth the trouble to make a filesystem
handle mixed states correctly.  Mostly as a guard against the
implementations.

--D

> -Dave.
> -- 
> Dave Chinner
> david@...morbit.com
>