linux-kernel - Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <3ab6000e-030d-435a-88c3-9026171ae9f1@oracle.com>
Date: Thu, 5 Dec 2024 10:52:50 +0000
From: John Garry <john.g.garry@...cle.com>
To: Dave Chinner <david@...morbit.com>
Cc: brauner@...nel.org, djwong@...nel.org, cem@...nel.org, dchinner@...hat.com,
        hch@....de, ritesh.list@...il.com, linux-xfs@...r.kernel.org,
        linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
        martin.petersen@...cle.com
Subject: Re: [PATCH 1/4] iomap: Lift blocksize restriction on atomic writes

On 04/12/2024 20:35, Dave Chinner wrote:
> On Wed, Dec 04, 2024 at 03:43:41PM +0000, John Garry wrote:
>> From: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
>>
>> Filesystems like ext4 can submit writes in multiples of blocksizes.
>> But we still can't allow the writes to be split into multiple BIOs. Hence
>> let's check if the iomap_length() is same as iter->len or not.
>>
>> It is the responsibility of userspace to ensure that a write does not span
>> mixed unwritten and mapped extents (which would lead to multiple BIOs).
> 
> How is "userspace" supposed to do this?

If an atomic write spans mixed unwritten and mapped extents, then it 
should manually zero the unwritten extents beforehand.

> 
> No existing utility in userspace is aware of atomic write limits or
> rtextsize configs, so how does "userspace" ensure everything is
> laid out in a manner compatible with atomic writes?
> 
> e.g. restoring a backup (or other disaster recovery procedures) is
> going to have to lay the files out correctly for atomic writes.
> backup tools often sparsify the data set and so what gets restored
> will not have the same layout as the original data set...

I am happy to support whatever is needed to make atomic writes work over
mixed extents if that is really an expected use case and it is a pain 
for an application writer/admin to deal with this (by manually zeroing 
extents).

JFYI, I did originally support the extent pre-zeroing for this. That was 
to support a real-life scenario which we saw where we were attempting 
atomic writes over mixed extents. The mixed extents were coming from 
userspace punching holes and then attempting an atomic write over that 
space. However that was using an early experimental and buggy 
forcealign; it was buggy as it did not handle punching holes properly - 
it punched out single blocks and not only full alloc units.

> 
> Where's the documentation that outlines all the restrictions on
> userspace behaviour to prevent this sort of problem being triggered?

I would provide a man page update.

> Common operations such as truncate, hole punch,

So how would punch hole be a problem? The atomic write unit max is 
limited by the alloc unit, and we can only punch out full alloc units.

> buffered writes,
> reflinks, etc will trip over this, so application developers, users
> and admins really need to know what they should be doing to avoid
> stepping on this landmine...

If this is not a real-life scenario which we expect to see, then I don't 
see why we would add the complexity to the kernel for this.

My motivation for atomic writes support is to support atomically writing 
large database internal page size. If the database only writes at a 
fixed internal page size, then we should not see mixed mappings.

But you see potential problems elsewhere ..

> 
> Further to that, what is the correction process for users to get rid
> of this error? They'll need some help from an atomic write
> constraint aware utility that can resilver the file such that these
> failures do not occur again. Can xfs_fsr do this? Or maybe the new
> exchangr-range code? Or does none of this infrastructure yet exist?

Nothing exists yet.

> 
>> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
>> jpg: tweak commit message
>> Signed-off-by: John Garry <john.g.garry@...cle.com>
>> ---
>>   fs/iomap/direct-io.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index b521eb15759e..3dd883dd77d2 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -306,7 +306,7 @@ static loff_t iomap_dio_bio_iter(const struct iomap_iter *iter,
>>   	size_t copied = 0;
>>   	size_t orig_count;
>>   
>> -	if (atomic && length != fs_block_size)
>> +	if (atomic && length != iter->len)
>>   		return -EINVAL;
> 
> Given this is now rejecting IOs that are otherwise well formed from
> userspace, this situation needs to have a different errno now. The
> userspace application has not provided an invalid set of
> IO parameters for this IO - the IO has been rejected because
> the previously set up persistent file layout was screwed up by
> something in the past.
> 
> i.e. This is not an application IO submission error anymore, hence
> EINVAL is the wrong error to be returning to userspace here.
> 

Understood, but let's come back to this (if needed..).

Thanks,
John