lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <87h68u79ij.fsf@gmail.com>
Date: Wed, 30 Oct 2024 05:21:48 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>, Jan Kara <jack@...e.cz>, "Darrick J . Wong" <djwong@...nel.org>, Christoph Hellwig <hch@...radead.org>, John Garry <john.g.garry@...cle.com>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>, linux-kernel@...r.kernel.org, linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org
Subject: Re: [PATCH 4/6] ext4: Warn if we ever fallback to buffered-io for DIO atomic writes


Hi Dave, 

Dave Chinner <david@...morbit.com> writes:

> On Mon, Oct 28, 2024 at 11:44:00PM +0530, Ritesh Harjani wrote:
>> 
>> Hi Dave, 
>> 
>> Dave Chinner <david@...morbit.com> writes:
>> 
>> > On Mon, Oct 28, 2024 at 06:39:36AM +0530, Ritesh Harjani wrote:
>> >> 
>> >> Hi Dave, 
>> >> 
>> >> Dave Chinner <david@...morbit.com> writes:
>> >> 
>> >> > On Fri, Oct 25, 2024 at 09:15:53AM +0530, Ritesh Harjani (IBM) wrote:
>> >> >> iomap will not return -ENOTBLK in case of dio atomic writes. But let's
>> >> >> also add a WARN_ON_ONCE and return -EIO as a safety net.
>> >> >> 
>> >> >> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
>> >> >> ---
>> >> >>  fs/ext4/file.c | 10 +++++++++-
>> >> >>  1 file changed, 9 insertions(+), 1 deletion(-)
>> >> >> 
>> >> >> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> >> >> index f9516121a036..af6ebd0ac0d6 100644
>> >> >> --- a/fs/ext4/file.c
>> >> >> +++ b/fs/ext4/file.c
>> >> >> @@ -576,8 +576,16 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> >> >>  		iomap_ops = &ext4_iomap_overwrite_ops;
>> >> >>  	ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops,
>> >> >>  			   dio_flags, NULL, 0);
>> >> >> -	if (ret == -ENOTBLK)
>> >> >> +	if (ret == -ENOTBLK) {
>> >> >>  		ret = 0;
>> >> >> +		/*
>> >> >> +		 * iomap will never return -ENOTBLK if write fails for atomic
>> >> >> +		 * write. But let's just add a safety net.
>> >> >> +		 */
>> >> >> +		if (WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC))
>> >> >> +			ret = -EIO;
>> >> >> +	}
>> >> >
>> >> > Why can't the iomap code return EIO in this case for IOCB_ATOMIC?
>> >> > That way we don't have to put this logic into every filesystem.
>> >> 
>> >> This was origially intended as a safety net hence the WARN_ON_ONCE.
>> >> Later Darrick pointed out that we still might have an unconverted
>> >> condition in iomap which can return ENOTBLK for DIO atomic writes (page
>> >> cache invalidation).
>> >
>> > Yes. That's my point - iomap knows that it's an atomic write, it
>> > knows that invalidation failed, and it knows that there is no such
>> > thing as buffered atomic writes. So there is no possible fallback
>> > here, and it should be returning EIO in the page cache invalidation
>> > failure case and not ENOTBLK.
>> >
>> 
>> So the iomap DIO can return following as return values which can make
>> some filesystems fallback to buffered-io (if they implement fallback
>> logic) - 
>> (1) -ENOTBLK -> this is only returned for pagecache invalidation failure.
>> (2) 0 or partial write size -> This can never happen for atomic writes
>> (since we are only allowing for single fsblock as of now).
>
> Even when we allow multi-FSB atomic writes, the definition of
> atomic write is still "all or nothing". There is no scope for "short
> writes" when IOCB_ATOMIC is set - any condition that means we can't
> write the entire IO as a single bio, we need to abort and return
> EINVAL.

yes. As long as it is a single bio, I agree even the short write
condition should not hit based on the current iomap code.

>
> Hence -ENOTBLK should never be returned by iomap for atomic DIO
> writes - we need to say -EINVAL if the write could not be issued
> atomically for whatever reason it may be so the application knows
> that atomic IO submission was not possible for that IO.
>

Agreed Dave. That is what iomap is doing today for atomic write code. 
(Except maybe one minor difference where it returns -EAGAIN in case of
page cache invalidation assuming the failure maybe transient and the
request could be tried again).


	
>> Now looking at XFS, it never fallsback to buffered-io ever except just 2
>> cases - 
>> 1. When pagecache invalidation fails in iomap (can never happen for
>> atomic writes)
>
> Why can't this happen for atomic DIO writes?  It's the same failure
> cases as for normal DIO writes, isn't it? (i.e. race with mmap
> writes)
>

I meant after the patch which adds atomic write support in iomap code
from John, make sure we don't return -ENOTBLK in case of atomic write request. 


> My point is that if it's an atomic write, this failure should get
> turned into -EINVAL by the iomap code. We do not want a fallback to
> buffered IO when this situation happens for atomic IO.
>
>> 2. On unaligned DIO writes to reflinked CoW (not possible for atomic writes)
>
> This path doesn't ever go through iomap - XFS catches that case
> before it calls into iomap, so it's not relevant to how iomap
> behaves w.r.t atomic IO.
>

Right.

>> So it anyways should never happen that XFS ever fallback to buffered-io
>> for DIO atomic writes. Even today it does not fallback to buffered-io
>> for non-atomic short DIO writes.
>> 
>> >> You pointed it right that it should be fixed in iomap. However do you
>> >> think filesystems can still keep this as safety net (maybe no need of
>> >> WARN_ON_ONCE).
>> >
>> > I don't see any point in adding "impossible to hit" checks into
>> > filesystems just in case some core infrastructure has a bug
>> > introduced....
>> 
>> Yes, that is true for XFS. EXT4 however can return -ENOTBLK for short
>> writes, though it should not happen for current atomic write case where
>> we are only allowing for 1 fsblock. 
>
> Yes, but the -ENOTBLK error returned from ext4_iomap_end() if
> nothing was written does not get returned to ext4 from
> __iomap_dio_rw(). It is consumed by the iomap code:
>
> 	/* magic error code to fall back to buffered I/O */
>         if (ret == -ENOTBLK) {
>                 wait_for_completion = true;
>                 ret = 0;
> 	}
>
> This means that all the IO that was issued gets completed before
> returning to the caller and that's how the short write comes about.
>
> -ENOTBLK is *not returned to the caller* on a short write -

yes. That's my understanding too of the short write case handling in
iomap.

> iomap_dio_rw will return 0 (success).  The caller then has to look
> at the iov_iter state to determine if the write was fully completed.
> This is exactly what the ext4 code currently does for all DIO
> writes, not just those that return -ENOTBLK.
>

yes. Agreed.

>> I would still like to go with a WARN_ON_ONCE where we are calling ext4
>> buffered-io handling for DIO fallback writes. This is to catch any bugs
>> even in future when we move to multi-fsblock case (until we have atomic
>> write support for buffered-io).
>
> Your choice, but please realise that it is not going to catch short
> atomic writes at all.
>

Thanks Dave. Yes, I would like to maybe keep a WARN_ON_ONCE since ext4
has a fallback handling logic where a short DIO or -ENOTBLK case could
be later handled by buffered-io logic (though I agree iomap won't let it
happen for atomic write case). 

But a WARN_ON_ONCE just before buffered-io fallback handling logic in
ext4 DIO path would be my preferred choice only to make sure we could
catch any unwanted bugs in future too.

So I was thinking of this change instead - 


diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8116bd78910b..61787a37e9d4 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -599,6 +599,13 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
                ssize_t err;
                loff_t endbyte;

+               /*
+                * There is no support for atomic writes on buffered-io yet,
+                * we should never fallback to buffered-io for DIO atomic
+                * writes.
+                */
+               WARN_ON_ONCE(iocb->ki_flags & IOCB_ATOMIC);
+
                offset = iocb->ki_pos;
                err = ext4_buffered_write_iter(iocb, from);
                if (err < 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fcdee27b9aa2..26b3c84d7f64 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3449,12 +3449,16 @@ static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
 {
        /*
         * Check to see whether an error occurred while writing out the data to
-        * the allocated blocks. If so, return the magic error code so that we
-        * fallback to buffered I/O and attempt to complete the remainder of
-        * the I/O. Any blocks that may have been allocated in preparation for
-        * the direct I/O will be reused during buffered I/O.
+        * the allocated blocks. If so, return the magic error code for
+        * non-atomic write so that we fallback to buffered I/O and attempt to
+        * complete the remainder of the I/O.
+        * For atomic writes we will simply fail the I/O request if we coudn't
+        * write anything. For non-atomic writes, any blocks that may have been
+        * allocated in preparation for the direct I/O will be reused during
+        * buffered I/O.
         */
-       if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+       if (!(flags & IOMAP_ATOMIC) && (flags & (IOMAP_WRITE | IOMAP_DIRECT))
+                       && written == 0)
                return -ENOTBLK;

        return 0;


> -Dave.
> -- 
> Dave Chinner
> david@...morbit.com

Thanks a lot for the review!

-ritesh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ