[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZSRk9Z6/i2E+YV9A@dread.disaster.area>
Date: Tue, 10 Oct 2023 07:39:17 +1100
From: Dave Chinner <david@...morbit.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: John Garry <john.g.garry@...cle.com>, linux-kernel@...r.kernel.org,
linux-api@...r.kernel.org, martin.petersen@...cle.com,
himanshu.madhani@...cle.com
Subject: Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag
On Mon, Oct 09, 2023 at 10:44:38AM -0700, Darrick J. Wong wrote:
> On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
> > From: Himanshu Madhani <himanshu.madhani@...cle.com>
> >
> > Add RWF_ATOMIC flag description for pwritev2().
> >
> > Signed-off-by: Himanshu Madhani <himanshu.madhani@...cle.com>
> > #jpg: complete rewrite
> > Signed-off-by: John Garry <john.g.garry@...cle.com>
> > ---
> > man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 45 insertions(+)
....
> > +For when regular files are opened with
> > +.BR open (2)
> > +but without
> > +.B O_SYNC
> > +or
> > +.B O_DSYNC
> > +and the
> > +.BR pwritev2()
> > +call is made without
> > +.B RWF_SYNC
> > +or
> > +.BR RWF_DSYNC
> > +set, the range metadata must already be flushed to storage and the data range
> > +must not be in unwritten state, shared, a preallocation, or a hole.
>
> I think that we can drop all of these flags requirements, since the
> contiguous small space allocation requirement means that the fs can
> provide all-or-nothing writes even if metadata updates are needed:
>
> If the file range is allocated and marked unwritten (i.e. a
> preallocation), the ioend will clear the unwritten bit from the file
> mapping atomically. After a crash, the application sees either zeroes
> or all the data that was written.
>
> If the file range is shared, the ioend will map the COW staging extent
> into the file atomically. After a crash, the application sees either
> the old contents from the old blocks, or the new contents from the new
> blocks.
>
> If the file range is a sparse hole, the directio setup will allocate
> space and create an unwritten mapping before issuing the write bio. The
> rest of the process works the same as preallocations and has the same
> behaviors.
>
> If the file range is allocated and was previously written, the write is
> issued and that's all that's needed from the fs. After a crash, reads
> of the storage device produce the old contents or the new contents.
This is exactly what I explained when reviewing the code that
rejected RWF_ATOMIC without O_DSYNC on metadata dirty inodes.
> Summarizing:
>
> An (ATOMIC|SYNC) request provides the strongest guarantees (data
> will not be torn, and all file metadata updates are persisted before
> the write is returned to userspace. Programs see either the old data or
> the new data, even if there's a crash.
>
> (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> updates for just that region are persisted before the write is returned.
>
> (ATOMIC) is the least strong -- data will not be torn. Neither the
> filesystem nor the device make guarantees that anything ended up on
> stable storage, but if it does, programs see either the old data or the
> new data.
Yup, that makes sense to me.
> Maybe we should rename the whole UAPI s/atomic/untorn/...
Perhaps, though "torn writes" is nomenclature that nobody outside
storage and filesystem developers really knows about. All I ever
hear from userspace developers is "we want atomic/all-or-nothing
data writes"...
-Dave.
--
Dave Chinner
david@...morbit.com
Powered by blists - more mailing lists