[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878qza2t1p.fsf@gmail.com>
Date: Wed, 12 Jun 2024 18:54:02 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org, Dave Chinner <david@...morbit.com>, Matthew Wilcox <willy@...radead.org>, Christoph Hellwig <hch@...radead.org>, Christian Brauner <brauner@...nel.org>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>, Luis Chamberlain <mcgrof@...nel.org>
Subject: Re: [PATCH] Documentation: document the design of iomap and how to port
<snip>
> +Direct I/O
> +----------
> +
> +In Linux, direct I/O is defined as file I/O that is issued directly to
> +storage, bypassing the pagecache.
> +
> +The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
> +writes for files.
> +An optional ``ops`` parameter can be passed to change the behavior of
> +direct I/O.
Did you mean "dops" iomap_dio_ops (instead of ops)?
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before);
1. Also can you please explain what you meant by "change the behavior of
direct-io"?
2. Do you think we should add the function declaration of
iomap_dio_rw() here, given it has so many arguments?
> +The ``done_before`` parameter should be set if writes have been
> +initiated prior to the call.
I don't think this is specific to "writes" alone.
Maybe this?
The ``done_before`` parameter tells the how much of the request has
already been transferred. It gets used for finishing a request
asynchronously when part of the request has already been complete
synchronously.
Maybe please also add a the link to this (for easy reference).
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103
> +The direction of the I/O is determined from the ``iocb`` passed in.
> +
> +The ``flags`` argument can be any of the following values:
Callers of iomap_dio_rw() can set the flags argument which can be any of
the following values:
Just a bit more descriptive ^^^
> +
> + * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
> + kiocb is not synchronous.
Adding an example would be nice.
e.g. callers might want to consider setting this flag for extending writes.
> +
> + * ``IOMAP_DIO_OVERWRITE_ONLY``: Allocating blocks, zeroing partial
> + blocks, and extensions of the file size are not allowed.
> + The entire file range must to map to a single written or unwritten
^^^ an extra to
> + extent.
> + This flag exists to enable issuing concurrent direct IOs with only
> + the shared ``i_rwsem`` held when the file I/O range is not aligned to
> + the filesystem block size.
> + ``-EAGAIN`` will be returned if the operation cannot proceed.
Can we please add these below details too. I would rather avoid wasting
my time in searching the history about, why EXT4 does not use this flag :)
Currently XFS uses this flag. EXT4 does not use it since it checks for
overwrites or unaligned overwrites and uses appropriate locking
up front rather than on a retry response to -EAGAIN [1] [2].
[1]: https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/
[2]: https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/
> +
> + * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
> + progress has already been made.
> + The caller may deal with the page fault and retry the operation.
Callers use ``dio_before`` argument along with ``IOMAP_DIO_PARTIAL`` to
tell the iomap subsystem about how much of the requested I/O was already
done.
> +
> +These ``struct kiocb`` flags are significant for direct I/O with iomap:
> +
> + * ``IOCB_NOWAIT``: Only proceed with the I/O if mapping data are
> + already in memory, we do not have to initiate other I/O, and we
> + acquire all filesystem locks without blocking.
> +
Maybe explicitly mentioning about "no block allocation"?
* ``IOCB_NOWAIT``: Only proceed with the I/O if mapping data are
already in memory, we do not have to initiate other I/O or do any
block allocations, and we acquire all filesystem locks without
blocking.
> + * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
> + BEFORE completing the call.
> + In the case of pure overwrites, the I/O may be issued with FUA
> + enabled.
> +
> + * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
> + interrupt.
> + Only meaningful for asynchronous I/O, and only if the entire I/O can
> + be issued as a single ``struct bio``.
> +
> + * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
> + process context.
> + See ``linux/fs.h`` for more details.
> +
> +Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
> +``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
> +function for the file.
> +They should not set ``->direct_IO``, which is deprecated.
> +
Return value:
~~~~~~~~~~~~~
On success it will return the number of bytes transferred. On failure it
will return a negative error value.
Note:
-ENOTBLK is a magic return value which callers may use for falling back
to buffered-io. ->iomap_end()/->iomap_begin() can decide to return this
magic return value if it decides to fallback to buffered-io. iomap
subsystem return this value in case if it fails to invalidate the
pagecache pages belonging to the direct-io range before initiating the
direct-io.
-EIOCBQUEUED is returned when an async direct-io request is queued for I/O.
> +If a filesystem wishes to perform its own work before direct I/O
> +completion, it should call ``__iomap_dio_rw``.
> +If its return value is not an error pointer or a NULL pointer, the
> +filesystem should pass the return value to ``iomap_dio_complete`` after
> +finishing its internal work.
> +
> +Direct Reads
> +~~~~~~~~~~~~
> +
> +A direct I/O read initiates a read I/O from the storage device to the
> +caller's buffer.
> +Dirty parts of the pagecache are flushed to storage before initiating
> +the read io.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
> +any combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Read if mapping data are already in memory.
> + Does not initiate other I/O or block on filesystem locks.
> +
> +Callers commonly hold ``i_rwsem`` in shared mode.
> +
> +Direct Writes
> +~~~~~~~~~~~~~
> +
> +A direct I/O write initiates a write I/O to the storage device to the
> +caller's buffer.
to the storage device "from" the caller's buffer.
> +Dirty parts of the pagecache are flushed to storage before initiating
> +the write io.
> +The pagecache is invalidated both before and after the write io.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
> +IOMAP_WRITE`` with any combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Write if mapping data are already in memory.
> + Does not initiate other I/O or block on filesystem locks.
> + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
> + blocks is not allowed.
> + The entire file range must to map to a single written or unwritten
> + extent.
> + The file I/O range must be aligned to the filesystem block size.
> +
> +Callers commonly hold ``i_rwsem`` in shared or exclusive mode.
> +
> +struct iomap_dio_ops:
> +~~~~~~~~~~~~~~~~~~~~~
> +.. code-block:: c
> +
> + struct iomap_dio_ops {
> + void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
> + loff_t file_offset);
> + int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
> + unsigned flags);
> + struct bio_set *bio_set;
> + };
> +
> +The fields of this structure are as follows:
> +
> + - ``submit_io``: iomap calls this function when it has constructed a
> + ``struct bio`` object for the I/O requested, and wishes to submit it
> + to the block device.
> + If no function is provided, ``submit_bio`` will be called directly.
> + Filesystems that would like to perform additional work before (e.g.
> + data replication for btrfs) should implement this function.
> +
> + - ``end_io``: This is called after the ``struct bio`` completes.
> + This function should perform post-write conversions of unwritten
> + extent mappings, handle write failures, etc.
> + The ``flags`` argument may be set to a combination of the following:
> +
> + * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
> + should mark the extent as written.
> +
> + * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
> + copy on write operation, so the ioend should switch mappings.
> +
> + - ``bio_set``: This allows the filesystem to provide a custom bio_set
> + for allocating direct I/O bios.
> + This enables filesystems to `stash additional per-bio information
> + <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
> + for private use.
> + If this field is NULL, generic ``struct bio`` objects will be used.
> +
> +Filesystems that want to perform extra work after an I/O completion
> +should set a custom ``->bi_end_io`` function via ``->submit_io``.
> +Afterwards, the custom endio function must call
> +``iomap_dio_bio_end_io`` to finish the direct I/O.
> +
> +DAX I/O
> +-------
> +
> +Storage devices that can be directly mapped as memory support a new
> +access mode known as "fsdax".
Added a comma before "support" for better readability.
Storage devices that can be directly mapped as memory, support a new
access mode known as "fsdax".
> +
> +fsdax Reads
> +~~~~~~~~~~~
> +
> +A fsdax read performs a memcpy from storage device to the caller's
> +buffer.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
> +combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Read if mapping data are already in memory.
> + Does not initiate other I/O or block on filesystem locks.
> +
> +Callers commonly hold ``i_rwsem`` in shared mode.
> +
> +fsdax Writes
> +~~~~~~~~~~~~
> +
> +A fsdax write initiates a memcpy to the storage device to the caller's
"from" the storage device
> +buffer.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
> +IOMAP_WRITE`` with any combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Write if mapping data are already in memory.
> + Does not initiate other I/O or block on filesystem locks.
> +
> + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
> + blocks is not allowed.
> + The entire file range must to map to a single written or unwritten
> + extent.
> + The file I/O range must be aligned to the filesystem block size.
> +
> +Callers commonly hold ``i_rwsem`` in exclusive mode.
> +
> +mmap Faults
> +~~~~~~~~~~~
> +
> +The ``dax_iomap_fault`` function handles read and write faults to fsdax
> +storage.
> +For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
> +``flags`` argument to ``->iomap_begin``.
> +For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
> +passed as the ``flags`` argument to ``->iomap_begin``.
> +
> +Callers commonly hold the same locks as they do to call their iomap
> +pagecache counterparts.
> +
> +Truncation, fallocate, and Unsharing
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For fsdax files, the following functions are provided to replace their
> +iomap pagecache I/O counterparts.
> +The ``flags`` argument to ``->iomap_begin`` are the same as the
> +pagecache counterparts, with ``IOMAP_DIO`` added.
with "IOMAP_DAX"
> +
> + * ``dax_file_unshare``
> + * ``dax_zero_range``
> + * ``dax_truncate_page``
Shall we mention
"dax_remap_file_range_prep()/dax_dedupe_file_range_compare()" ?
> +
> +Callers commonly hold the same locks as they do to call their iomap
> +pagecache counterparts.
> +
Stopping here for now. Will resume the rest of the document review
later.
-ritesh
Powered by blists - more mailing lists