linux-kernel - Re: [PATCHv3 0/8] direct-io: even more flexible io vectors

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aLcBivUrXs0YZ-pq@kernel.org>
Date: Tue, 2 Sep 2025 10:39:06 -0400
From: Mike Snitzer <snitzer@...nel.org>
To: Jan Kara <jack@...e.cz>
Cc: Ritesh Harjani <ritesh.list@...il.com>, Keith Busch <kbusch@...nel.org>,
	Keith Busch <kbusch@...a.com>, linux-block@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-ext4@...r.kernel.org, axboe@...nel.dk, dw@...idwei.uk,
	brauner@...nel.org, hch@....de, martin.petersen@...cle.com,
	djwong@...nel.org, linux-xfs@...r.kernel.org,
	viro@...iv.linux.org.uk, Jan Kara <jack@...e.com>,
	Brian Foster <bfoster@...hat.com>
Subject: Re: [PATCHv3 0/8] direct-io: even more flexible io vectors

On Mon, Sep 01, 2025 at 09:55:20AM +0200, Jan Kara wrote:
> Hi Mike!
> 
> On Wed 27-08-25 12:09:29, Mike Snitzer wrote:
> > On Wed, Aug 27, 2025 at 05:20:53PM +0200, Jan Kara wrote:
> > > On Tue 26-08-25 10:29:58, Ritesh Harjani wrote:
> > > > Keith Busch <kbusch@...nel.org> writes:
> > > > 
> > > > > On Mon, Aug 25, 2025 at 02:07:15PM +0200, Jan Kara wrote:
> > > > >> On Fri 22-08-25 18:57:08, Ritesh Harjani wrote:
> > > > >> > Keith Busch <kbusch@...a.com> writes:
> > > > >> > >
> > > > >> > >   - EXT4 falls back to buffered io for writes but not for reads.
> > > > >> > 
> > > > >> > ++linux-ext4 to get any historical context behind why the difference of
> > > > >> > behaviour in reads v/s writes for EXT4 DIO. 
> > > > >> 
> > > > >> Hum, how did you test? Because in the basic testing I did (with vanilla
> > > > >> kernel) I get EINVAL when doing unaligned DIO write in ext4... We should be
> > > > >> falling back to buffered IO only if the underlying file itself does not
> > > > >> support any kind of direct IO.
> > > > >
> > > > > Simple test case (dio-offset-test.c) below.
> > > > >
> > > > > I also ran this on vanilla kernel and got these results:
> > > > >
> > > > >   # mkfs.ext4 /dev/vda
> > > > >   # mount /dev/vda /mnt/ext4/
> > > > >   # make dio-offset-test
> > > > >   # ./dio-offset-test /mnt/ext4/foobar
> > > > >   write: Success
> > > > >   read: Invalid argument
> > > > >
> > > > > I tracked the "write: Success" down to ext4's handling for the "special"
> > > > > -ENOTBLK error after ext4_want_directio_fallback() returns "true".
> > > > >
> > > > 
> > > > Right. Ext4 has fallback only for dio writes but not for DIO reads... 
> > > > 
> > > > buffered
> > > > static inline bool ext4_want_directio_fallback(unsigned flags, ssize_t written)
> > > > {
> > > > 	/* must be a directio to fall back to buffered */
> > > > 	if ((flags & (IOMAP_WRITE | IOMAP_DIRECT)) !=
> > > > 		    (IOMAP_WRITE | IOMAP_DIRECT))
> > > > 		return false;
> > > > 
> > > >     ...
> > > > }
> > > > 
> > > > So basically the path is ext4_file_[read|write]_iter() -> iomap_dio_rw
> > > >     -> iomap_dio_bio_iter() -> return -EINVAL. i.e. from...
> > > > 
> > > > 
> > > > 	if ((pos | length) & (bdev_logical_block_size(iomap->bdev) - 1) ||
> > > > 	    !bdev_iter_is_aligned(iomap->bdev, dio->submit.iter))
> > > > 		return -EINVAL;
> > > > 
> > > > EXT4 then fallsback to buffered-io only for writes, but not for reads. 
> > > 
> > > Right. And the fallback for writes was actually inadvertedly "added" by
> > > commit bc264fea0f6f "iomap: support incremental iomap_iter advances". That
> > > changed the error handling logic. Previously if iomap_dio_bio_iter()
> > > returned EINVAL, it got propagated to userspace regardless of what
> > > ->iomap_end() returned. After this commit if ->iomap_end() returns error
> > > (which is ENOTBLK in ext4 case), it gets propagated to userspace instead of
> > > the error returned by iomap_dio_bio_iter().
> > > 
> > > Now both the old and new behavior make some sense so I won't argue that the
> > > new iomap_iter() behavior is wrong. But I think we should change ext4 back
> > > to the old behavior of failing unaligned dio writes instead of them falling
> > > back to buffered IO. I think something like the attached patch should do
> > > the trick - it makes unaligned dio writes fail again while writes to holes
> > > of indirect-block mapped files still correctly fall back to buffered IO.
> > > Once fstests run completes, I'll do a proper submission...
> > > 
> > > 
> > > 								Honza
> > > -- 
> > > Jan Kara <jack@...e.com>
> > > SUSE Labs, CR
> > 
> > > From ce6da00a09647a03013c3f420c2e7ef7489c3de8 Mon Sep 17 00:00:00 2001
> > > From: Jan Kara <jack@...e.cz>
> > > Date: Wed, 27 Aug 2025 14:55:19 +0200
> > > Subject: [PATCH] ext4: Fail unaligned direct IO write with EINVAL
> > > 
> > > Commit bc264fea0f6f ("iomap: support incremental iomap_iter advances")
> > > changed the error handling logic in iomap_iter(). Previously any error
> > > from iomap_dio_bio_iter() got propagated to userspace, after this commit
> > > if ->iomap_end returns error, it gets propagated to userspace instead of
> > > an error from iomap_dio_bio_iter(). This results in unaligned writes to
> > > ext4 to silently fallback to buffered IO instead of erroring out.
> > > 
> > > Now returning ENOTBLK for DIO writes from ext4_iomap_end() seems
> > > unnecessary these days. It is enough to return ENOTBLK from
> > > ext4_iomap_begin() when we don't support DIO write for that particular
> > > file offset (due to hole).
> > 
> > Any particular reason for ext4 still returning -ENOTBLK for unaligned
> > DIO?
> 
> No, that is actually the bug I'm speaking about - ext4 should be returning
> EINVAL for unaligned DIO as other filesystems do but after recent iomap
> changes it started to return ENOTBLK.
> 
> > In my experience XFS returns -EINVAL when failing unaligned DIO (but
> > maybe there are edge cases where that isn't always the case?)
> > 
> > Would be nice to have consistency across filesystems for what is
> > returned when failing unaligned DIO.
> 
> Agreed although there are various corner cases like files which never
> support direct IO - e.g. with data journalling - and thus fallback to
> buffered IO happens before any alignment checks. 
> 
> > The iomap code returns -ENOTBLK as "the magic error code to fall back
> > to buffered I/O".  But that seems only for page cache invalidation
> > failure, _not_ for unaligned DIO.
> > 
> > (Anyway, __iomap_dio_rw's WRITE handling can return -ENOTBLK if page
> > cache invalidation fails during DIO write. So it seems higher-level
> > code, like I've added to NFS/NFSD to check for unaligned DIO failure,
> > should check for both -EINVAL and -ENOTBLK).
> 
> I think the idea here is that if page cache invalidation fails we want to
> fallback to buffered IO so that we don't cause cache coherency issues and
> that's why ENOTBLK is returned.
> 
> > ps. ENOTBLK is actually much less easily confused with other random
> > uses of EINVAL (EINVAL use is generally way too overloaded, rendering
> > it a pretty unhelpful error).  But switching XFS to use ENOTBLK
> > instead of EINVAL seems like disruptive interface breakage (I suppose
> > same could be said for ext4 if it were to now return EINVAL for
> > unaligned DIO, but ext4 flip-flopping on how it handles unaligned DIO
> > prompted me to ask these questions now)
> 
> Definitely. In this particular case EINVAL for unaligned DIO is there for
> ages and there likely is some userspace program somewhere that depends on
> it.

Thanks for your reply, that all makes sense.

Mike