linux-kernel - Re: [PATCH 2/4] readv.2: Document RWF

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <0262fed7-a70f-8782-628f-2e9ded0108f8@oracle.com>
Date:   Tue, 24 Oct 2023 13:30:03 +0100
From:   John Garry <john.g.garry@...cle.com>
To:     "Darrick J. Wong" <djwong@...nel.org>
Cc:     linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
        martin.petersen@...cle.com, david@...morbit.com,
        himanshu.madhani@...cle.com
Subject: Re: [PATCH 2/4] readv.2: Document RWF_ATOMIC flag

On 09/10/2023 18:44, Darrick J. Wong wrote:
> On Fri, Sep 29, 2023 at 09:37:15AM +0000, John Garry wrote:
>> From: Himanshu Madhani <himanshu.madhani@...cle.com>
>>
>> Add RWF_ATOMIC flag description for pwritev2().
>>
>> Signed-off-by: Himanshu Madhani <himanshu.madhani@...cle.com>
>> #jpg: complete rewrite
>> Signed-off-by: John Garry <john.g.garry@...cle.com>
>> ---
>>   man2/readv.2 | 45 +++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 45 insertions(+)
>>
>> diff --git a/man2/readv.2 b/man2/readv.2
>> index fa9b0e4e44a2..ff09f3bc9792 100644
>> --- a/man2/readv.2
>> +++ b/man2/readv.2
>> @@ -193,6 +193,51 @@ which provides lower latency, but may use additional resources.
>>   .B O_DIRECT
>>   flag.)
>>   .TP
>> +.BR RWF_ATOMIC " (since Linux 6.7)"
>> +Allows block-based filesystems to indicate that write operations will be issued
> 
> "Require regular file write operations to be issued with torn write
> protection."

ok

> 
>> +with torn-write protection. Torn-write protection means that for a power or any
>> +other hardware failure, all or none of the data from the write will be stored,
>> +but never a mix of old and new data. This flag is meaningful only for
>> +.BR pwritev2 (),
>> +and its effect applies only to the data range written by the system call.
>> +The total write length must be power-of-2 and must be sized between
>> +stx_atomic_write_unit_min and stx_atomic_write_unit_max, both inclusive. The
>> +write must be at a natural offset within the file with respect to the total
> 
> What is a "natural" offset?

I really meant naturally-aligned offset

>  That should be defined with more
> specificity.  Does that mean that the position of a XX-KiB write must
> also be aligned to XX-KiB?

Yes

>  e.g. a 32K untorn write can only start at a
> multiple of 32K? 

Correct

> What if the device supports untorn writes between 4K
> and 64K, does that mean I /cannot/ issue a 32K untorn write at offset
> 48K?

Correct

Do you think that an example would help?

> 
>> +write length. Torn-write protection only works with
>> +.B O_DIRECT
>> +flag, i.e. buffered writes are not supported. To guarantee consistency from
>> +the write between a file's in-core state with the storage device,
>> +.BR fdatasync (2)
>> +or
>> +.BR fsync (2)
>> +or
>> +.BR open (2)
>> +and
>> +.B O_SYNC
>> +or
>> +.B O_DSYNC
>> +or
>> +.B pwritev2 ()
>> +flag
>> +.B RWF_SYNC
>> +or
>> +.B RWF_DSYNC
>> +is required.
> 
> I'm starting to think that this manpage shouldn't be restating
> durability information here.
> 
> "Application programs with data or file integrity completion
> requirements must configure synchronous writes with the DSYNC
> or SYNC flags, as explained above."

ok

> 
>> +For when regular files are opened with
>> +.BR open (2)
>> +but without
>> +.B O_SYNC
>> +or
>> +.B O_DSYNC
>> +and the
>> +.BR pwritev2()
>> +call is made without
>> +.B RWF_SYNC
>> +or
>> +.BR RWF_DSYNC
>> +set, the range metadata must already be flushed to storage and the data range
>> +must not be in unwritten state, shared, a preallocation, or a hole.
> 
> I think that we can drop all of these flags requirements, since the
> contiguous small space allocation requirement means that the fs can
> provide all-or-nothing writes even if metadata updates are needed:
> 
> If the file range is allocated and marked unwritten (i.e. a
> preallocation), the ioend will clear the unwritten bit from the file
> mapping atomically.  After a crash, the application sees either zeroes
> or all the data that was written.
> 
> If the file range is shared, the ioend will map the COW staging extent
> into the file atomically.  After a crash, the application sees either
> the old contents from the old blocks, or the new contents from the new
> blocks.
> 
> If the file range is a sparse hole, the directio setup will allocate
> space and create an unwritten mapping before issuing the write bio.  The
> rest of the process works the same as preallocations and has the same
> behaviors.
> 
> If the file range is allocated and was previously written, the write is
> issued and that's all that's needed from the fs.  After a crash, reads
> of the storage device produce the old contents or the new contents.
> 
> Summarizing:
> 
> An (ATOMIC|SYNC) request provides the strongest guarantees (data
> will not be torn, and all file metadata updates are persisted before
> the write is returned to userspace.  Programs see either the old data or
> the new data, even if there's a crash.
> 
> (ATOMIC|DSYNC) is less strong -- data will not be torn, and any file
> updates for just that region are persisted before the write is returned.
> 
> (ATOMIC) is the least strong -- data will not be torn.  Neither the
> filesystem nor the device make guarantees that anything ended up on
> stable storage, but if it does, programs see either the old data or the
> new data.
> 


Will respond to later mail in thread.

> Maybe we should rename the whole UAPI s/atomic/untorn/...
>  > --D
> 
>> +.TP
>>   .BR RWF_SYNC " (since Linux 4.7)"
>>   .\" commit e864f39569f4092c2b2bc72c773b6e486c7e3bd9
>>   Provide a per-write equivalent of the
>> -- 
>> 2.31.1
>>

Thanks,
John