linux-kernel - Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Zygo6nqOJMoJxYrm@dread.disaster.area>
Date: Mon, 4 Nov 2024 12:52:42 +1100
From: Dave Chinner <david@...morbit.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Ritesh Harjani <ritesh.list@...il.com>, Theodore Ts'o <tytso@....edu>,
	John Garry <john.g.garry@...cle.com>, linux-ext4@...r.kernel.org,
	Jan Kara <jack@...e.cz>, Christoph Hellwig <hch@...radead.org>,
	Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-kernel@...r.kernel.org,
	linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 5/6] iomap: Lift blocksize restriction on atomic writes

On Thu, Oct 31, 2024 at 02:36:40PM -0700, Darrick J. Wong wrote:
> On Sat, Oct 26, 2024 at 10:05:44AM +0530, Ritesh Harjani wrote:
> > > This gets me to the third and much less general solution -- only allow
> > > untorn writes if we know that the ioend only ever has to run a single
> > > transaction.  That's why untorn writes are limited to a single fsblock
> > > for now -- it's a simple solution so that we can get our downstream
> > > customers to kick the tires and start on the next iteration instead of
> > > spending years on waterfalling.
> > >
> > > Did you notice that in all of these cases, the capabilities of the
> > > filesystem's ioend processing determines the restrictions on the number
> > > and type of mappings that ->iomap_begin can give to iomap?
> > >
> > > Now that we have a second system trying to hook up to the iomap support,
> > > it's clear to me that the restrictions on mappings are specific to each
> > > filesystem.  Therefore, the iomap directio code should not impose
> > > restrictions on the mappings it receives unless they would prevent the
> > > creation of the single aligned bio.
> > >
> > > Instead, xfs_direct_write_iomap_begin and ext4_iomap_begin should return
> > > EINVAL or something if they look at the file mappings and discover that
> > > they cannot perform the ioend without risking torn mapping updates.  In
> > > the long run, ->iomap_begin is where this iomap->len <= iter->len check
> > > really belongs, but hold that thought.
> > >
> > > For the multi fsblock case, the ->iomap_begin functions would have to
> > > check that only one metadata update would be necessary in the ioend.
> > > That's where things get murky, since ext4/xfs drop their mapping locks
> > > between calls to ->iomap_begin.  So you'd have to check all the mappings
> > > for unsupported mixed state every time.  Yuck.
> > >
> > 
> > Thanks Darrick for taking time summarizing what all has been done
> > and your thoughts here.
> > 
> > > It might be less gross to retain the restriction that iomap accepts only
> > > one mapping for the entire file range, like Ritesh has here.
> > 
> > less gross :) sure. 
> > 
> > I would like to think of this as, being less restrictive (compared to
> > only allowing a single fsblock) by adding a constraint on the atomic
> > write I/O request i.e.  
> > 
> > "Atomic write I/O request to a region in a file is only allowed if that
> > region has no partially allocated extents. Otherwise, the file system
> > can fail the I/O operation by returning -EINVAL."
> > 
> > Essentially by adding this constraint to the I/O request, we are
> > helping the user to prevent atomic writes from accidentally getting
> > torned and also allowing multi-fsblock writes. So I still think that
> > might be the right thing to do here or at least a better start. FS can
> > later work on adding such support where we don't even need above
> > such constraint on a given atomic write I/O request.
> 
> On today's ext4 call, Ted and Ritesh and I realized that there's a bit
> more to it than this -- it's not possible to support untorn writes to a
> mix of written/(cow,unwritten) mappings even if they all point to the
> same physical space.  If the system fails after the storage device
> commits the write but before any of the ioend processing is scheduled, a
> subsequent read of the previously written blocks will produce the new
> data, but reads to the other areas will produce the old contents (or
> zeroes, or whatever).  That's a torn write.

I'm *really* surprised that people are only realising that IO
completion processing for atomic writes *must be atomic*.

This was the foundational constraint that the forced alignment
proposal for XFS was intended to address. i.e. to prevent fs
operations from violating atomic write IO constraints (e.g. punching
sub-atomic write size holes in the file) so that the physical IO can
be done without tearing and the IO completion processing that
exposes the new data can be done atomically.

> Therefore, iomap ought to stick to requiring that ->iomap_begin returns
> a single iomap to cover the entire file range for the untorn write.  For
> an unwritten extent, the post-recovery read will see either zeroes or
> the new contents; for a single-mapping COW it'll see old or new contents
> but not both.

I'm pretty sure we enforced that in the XFS mapping implemention for
atomic writes using forced alignment. i.e.  we have to return a
correctly aligned, contiguous mapping to iomap or we have to return
-EINVAL to indicate atomic write mapping failed.

Yes, we can check this in iomap, but it's really the filesystem that
has to implement and enforce it...

-Dave.
-- 
Dave Chinner
david@...morbit.com