linux-ext4 - Re: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap infrastructure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20191008151238.GK5078@quack2.suse.cz>
Date:   Tue, 8 Oct 2019 17:12:38 +0200
From:   Jan Kara <jack@...e.cz>
To:     Matthew Bobrowski <mbobrowski@...browski.org>
Cc:     tytso@....edu, jack@...e.cz, adilger.kernel@...ger.ca,
        linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
        hch@...radead.org, david@...morbit.com, darrick.wong@...cle.com
Subject: Re: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap
 infrastructure

On Thu 03-10-19 21:35:05, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O write code path implementation
> that makes use of the iomap infrastructure.
> 
> All direct I/O write operations are now passed from the ->write_iter()
> callback to the new function ext4_dio_write_iter(). This function is
> responsible for calling into iomap infrastructure via
> iomap_dio_rw(). Snippets of the direct I/O code from within
> ext4_file_write_iter(), such as checking whether the IO request is
> unaligned asynchronous I/O, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
> 
> The block mapping flags that are passed to ext4_map_blocks() from
> within ext4_dio_get_block() and friends have effectively been taken
> out and introduced within the ext4_iomap_alloc().
> 
> For inode extension cases, the ->end_io() callback
> ext4_dio_write_end_io() is responsible for calling into
> ext4_handle_inode_extension() and performing the necessary metadata
> updates. Failed writes that we're intended to be extend the inode will
> have blocks truncated accordingly. The ->end_io() handler is also
> responsible for converting allocated unwritten extents to written
> extents.
> 
> In the instance of a short write, we fallback to buffered I/O and use
> that method to complete whatever is left over in 'iter'.

> Any blocks
> that have been allocated in preparation for direct I/O write will be
> reused by buffered I/O, so there's no issue with leaving allocated
> blocks beyond EOF.

This actually is not true as ext4_truncate_failed_write() will trim blocks
beyond EOF. Also this would not be 100% reliable as if we crash between DIO
short write succeeding and buffered write happening, we would leave inode
with blocks beyond EOF. So I'd just remove this sentence.

> Existing direct I/O write buffer_head code has been removed as it's
> now redundant.
> 
> Signed-off-by: Matthew Bobrowski <mbobrowski@...browski.org>

...

> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> +                                int error, unsigned int flags)
> +{
> +       loff_t offset = iocb->ki_pos;
> +       struct inode *inode = file_inode(iocb->ki_filp);
> +
> +       if (!error && (flags & IOMAP_DIO_UNWRITTEN))
> +               error = ext4_convert_unwritten_extents(NULL, inode, offset,
> +                                                      size);
> +       return ext4_handle_inode_extension(inode, offset, error ? : size,
> +                                          size);
> +}

I was pondering about this and I don't think we have it quite correct.
Still :-|. The problem is that iomap_dio_complete() will pass dio->size as
'size', which is the amount of submitted IO but not necessarily the amount
of blocks that were mapped (that can be larger). Thus
ext4_handle_inode_extension() can miss the fact that there are blocks
beyond EOF that need to be trimmed. And we have no way of finding that out
inside our ->end_io handler. Even iomap_dio_complete() doesn't have that
information so we'd need to add 'original length' to struct iomap_dio and
then pass it do ->end_io.

Seeing how difficult it is when a filesystem wants to complete the iocb
synchronously (regardless whether it is async or sync) and have all the
information in one place for further processing, I think it would be the
easiest to provide iomap_dio_rw_wait() that forces waiting for the iocb to
complete *and* returns the appropriate return value instead of pretty
useless EIOCBQUEUED. It is actually pretty trivial (patch attached). With
this we can then just call iomap_dio_rw_sync() for the inode extension case
with ->end_io doing just the unwritten extent processing and then call
ext4_handle_inode_extension() from ext4_direct_write_iter() where we would
have all the information we need.

Christoph, Darrick, what do you think about extending iomap like in the
attached patch (plus sample use in XFS)?

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR

View attachment "0001-iomap-Allow-forcing-of-waiting-for-running-DIO-in-io.patch" of type "text/x-patch" (2967 bytes)

View attachment "0002-xfs-Use-iomap_dio_rw_wait.patch" of type "text/x-patch" (1326 bytes)