linux-kernel - Re: [PATCH v4 09/12] xfs: Add xfs_file_dio_write

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5e6795b1-305c-40a0-84d0-43dfb4ee6cd7@oracle.com>
Date: Mon, 10 Mar 2025 15:24:23 +0000
From: John Garry <john.g.garry@...cle.com>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>, brauner@...nel.org,
        djwong@...nel.org, cem@...nel.org
Cc: linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        linux-kernel@...r.kernel.org, ojaswin@...ux.ibm.com,
        martin.petersen@...cle.com, tytso@....edu, linux-ext4@...r.kernel.org
Subject: Re: [PATCH v4 09/12] xfs: Add xfs_file_dio_write_atomic()

On 10/03/2025 13:39, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry@...cle.com> writes:
> 
>> Add xfs_file_dio_write_atomic() for dedicated handling of atomic writes.
>>
>> In case of -EAGAIN being returned from iomap_dio_rw(), reissue the write
>> in CoW-based atomic write mode.
>>
>> For CoW-based mode, ensure that we have no outstanding IOs which we
>> may trample on.
>>
>> Reviewed-by: "Darrick J. Wong" <djwong@...nel.org>
>> Signed-off-by: John Garry <john.g.garry@...cle.com>
>> ---
>>   fs/xfs/xfs_file.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 42 insertions(+)
>>
>> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
>> index 51b4a43d15f3..70eb6928cf63 100644
>> --- a/fs/xfs/xfs_file.c
>> +++ b/fs/xfs/xfs_file.c
>> @@ -619,6 +619,46 @@ xfs_file_dio_write_aligned(
>>   	return ret;
>>   }
>>   
>> +static noinline ssize_t
>> +xfs_file_dio_write_atomic(
>> +	struct xfs_inode	*ip,
>> +	struct kiocb		*iocb,
>> +	struct iov_iter		*from)
>> +{
>> +	unsigned int		iolock = XFS_IOLOCK_SHARED;
>> +	unsigned int		dio_flags = 0;
>> +	ssize_t			ret;
>> +
>> +retry:
>> +	ret = xfs_ilock_iocb_for_write(iocb, &iolock);
>> +	if (ret)
>> +		return ret;
>> +
>> +	ret = xfs_file_write_checks(iocb, from, &iolock);
>> +	if (ret)
>> +		goto out_unlock;
>> +
>> +	if (dio_flags & IOMAP_DIO_FORCE_WAIT)
>> +		inode_dio_wait(VFS_I(ip));
>> +
>> +	trace_xfs_file_direct_write(iocb, from);
>> +	ret = iomap_dio_rw(iocb, from, &xfs_atomic_write_iomap_ops,
>> +			&xfs_dio_write_ops, dio_flags, NULL, 0);
>> +
>> +	if (ret == -EAGAIN && !(iocb->ki_flags & IOCB_NOWAIT) &&
>> +	    !(dio_flags & IOMAP_DIO_ATOMIC_SW)) {
>> +		xfs_iunlock(ip, iolock);
>> +		dio_flags = IOMAP_DIO_ATOMIC_SW | IOMAP_DIO_FORCE_WAIT;
>> +		iolock = XFS_IOLOCK_EXCL;
>> +		goto retry;
>> +	}
> 
> IIUC typically filesystems can now implement support for IOMAP_ATOMIC_SW
> as a fallback mechanism, by returning -EAGAIN error during
> IOMAP_ATOMIC_HW handling from their ->iomap_begin() routine.  They can
> then retry the entire DIO operation of iomap_dio_rw() by passing
> IOMAP_DIO_ATOMIC_SW flag in their dio_flags argument and handle
> IOMAP_ATOMIC_SW fallback differently in their ->iomap_begin() routine.
> 
> However, -EAGAIN can also be returned when there is a race with mmap
> writes for the same range. We return the same -EAGAIN error during page
> cache invalidation failure for IOCB_ATOMIC writes too.  However, current
> code does not differentiate between these two types of failures. Therefore,
> we always retry by falling back to SW CoW based atomic write even for
> page cache invalidation failures.
> 
> __iomap_dio_rw()
> {
> <...>
> 		/*
> 		 * Try to invalidate cache pages for the range we are writing.
> 		 * If this invalidation fails, let the caller fall back to
> 		 * buffered I/O.
> 		 */
> 		ret = kiocb_invalidate_pages(iocb, iomi.len);
> 		if (ret) {
> 			if (ret != -EAGAIN) {
> 				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> 								iomi.len);
> 				if (iocb->ki_flags & IOCB_ATOMIC) {
> 					/*
> 					 * folio invalidation failed, maybe
> 					 * this is transient, unlock and see if
> 					 * the caller tries again.
> 					 */
> 					ret = -EAGAIN;
> 				} else {
> 					/* fall back to buffered write */
> 					ret = -ENOTBLK;
> 				}
> 			}
> 			goto out_free_dio;
> 		}
> <...>
> }
> 
> It's easy to miss such error handling conditions. If this is something
> which was already discussed earlier, then perhaps it is better if
> document this.  BTW - Is this something that we already know of and has
> been kept as such intentionally?
> 

On mainline, for kiocb_invalidate_pages() error for IOCB_ATOMIC, we 
always return -EAGAIN to userspace.

Now if we have any kiocb_invalidate_pages() error for IOCB_ATOMIC, we 
retry with SW CoW mode - and if it fails again, we return -EAGAIN to 
userspace.

If we choose some other error code to trigger the SW-based COW retry (so 
that we don't always retry for kiocb_invalidate_pages() error when 
!IOMAP_DIO_ATOMIC_HW), then kiocb_invalidate_pages() could still return 
that same error code and we still retry in SW-based COW mode - is that 
better? Or do we need to choose some error code which 
kiocb_invalidate_pages() would never return?

Note that -EAGAIN is used by xfs_file_dio_unwrite_unaligned(), so would 
be nice to use the same error code.

Thanks,
John