lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <878qza2t1p.fsf@gmail.com>
Date: Wed, 12 Jun 2024 18:54:02 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org, Dave Chinner <david@...morbit.com>, Matthew Wilcox <willy@...radead.org>, Christoph Hellwig <hch@...radead.org>, Christian Brauner <brauner@...nel.org>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>, Luis Chamberlain <mcgrof@...nel.org>
Subject: Re: [PATCH] Documentation: document the design of iomap and how to port


<snip>
> +Direct I/O
> +----------
> +
> +In Linux, direct I/O is defined as file I/O that is issued directly to
> +storage, bypassing the pagecache.
> +
> +The ``iomap_dio_rw`` function implements O_DIRECT (direct I/O) reads and
> +writes for files.
> +An optional ``ops`` parameter can be passed to change the behavior of
> +direct I/O.

Did you mean "dops" iomap_dio_ops (instead of ops)?

ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
		const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
		unsigned int dio_flags, void *private, size_t done_before);

1. Also can you please explain what you meant by "change the behavior of
direct-io"?

2. Do you think we should add the function declaration of
iomap_dio_rw() here, given it has so many arguments?


> +The ``done_before`` parameter should be set if writes have been
> +initiated prior to the call.

I don't think this is specific to "writes" alone. 

Maybe this?

The ``done_before`` parameter tells the how much of the request has
already been transferred. It gets used for finishing a request
asynchronously when part of the request has already been complete
synchronously.

Maybe please also add a the link to this (for easy reference).
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c03098d4b9ad76bca2966a8769dcfe59f7f85103

> +The direction of the I/O is determined from the ``iocb`` passed in.
> +
> +The ``flags`` argument can be any of the following values:

Callers of iomap_dio_rw() can set the flags argument which can be any of
the following values:

Just a bit more descriptive ^^^

> +
> + * ``IOMAP_DIO_FORCE_WAIT``: Wait for the I/O to complete even if the
> +   kiocb is not synchronous.

Adding an example would be nice.

e.g. callers might want to consider setting this flag for extending writes.

> +
> + * ``IOMAP_DIO_OVERWRITE_ONLY``: Allocating blocks, zeroing partial
> +   blocks, and extensions of the file size are not allowed.
> +   The entire file range must to map to a single written or unwritten
                                ^^^ an extra to

> +   extent.
> +   This flag exists to enable issuing concurrent direct IOs with only
> +   the shared ``i_rwsem`` held when the file I/O range is not aligned to
> +   the filesystem block size.
> +   ``-EAGAIN`` will be returned if the operation cannot proceed.

Can we please add these below details too. I would rather avoid wasting
my time in searching the history about, why EXT4 does not use this flag :)

Currently XFS uses this flag. EXT4 does not use it since it checks for
overwrites or unaligned overwrites and uses appropriate locking
up front rather than on a retry response to -EAGAIN [1] [2].

[1]: https://lore.kernel.org/linux-ext4/20230810165559.946222-1-bfoster@redhat.com/
[2]: https://lore.kernel.org/linux-ext4/20230314130759.642710-1-bfoster@redhat.com/

> +
> + * ``IOMAP_DIO_PARTIAL``: If a page fault occurs, return whatever
> +   progress has already been made.
> +   The caller may deal with the page fault and retry the operation.

Callers use ``dio_before`` argument along with ``IOMAP_DIO_PARTIAL`` to
tell the iomap subsystem about how much of the requested I/O was already
done.

> +
> +These ``struct kiocb`` flags are significant for direct I/O with iomap:
> +
> + * ``IOCB_NOWAIT``: Only proceed with the I/O if mapping data are
> +   already in memory, we do not have to initiate other I/O, and we
> +   acquire all filesystem locks without blocking.
> +

Maybe explicitly mentioning about "no block allocation"?

* ``IOCB_NOWAIT``: Only proceed with the I/O if mapping data are
  already in memory, we do not have to initiate other I/O or do any
  block allocations, and we acquire all filesystem locks without
  blocking.
  
> + * ``IOCB_SYNC``: Ensure that the device has persisted data to disk
> +   BEFORE completing the call.
> +   In the case of pure overwrites, the I/O may be issued with FUA
> +   enabled.
> +
> + * ``IOCB_HIPRI``: Poll for I/O completion instead of waiting for an
> +   interrupt.
> +   Only meaningful for asynchronous I/O, and only if the entire I/O can
> +   be issued as a single ``struct bio``.
> +
> + * ``IOCB_DIO_CALLER_COMP``: Try to run I/O completion from the caller's
> +   process context.
> +   See ``linux/fs.h`` for more details.
> +
> +Filesystems should call ``iomap_dio_rw`` from ``->read_iter`` and
> +``->write_iter``, and set ``FMODE_CAN_ODIRECT`` in the ``->open``
> +function for the file.
> +They should not set ``->direct_IO``, which is deprecated.
> +

Return value: 
~~~~~~~~~~~~~
On success it will return the number of bytes transferred. On failure it
will return a negative error value. 

Note:
-ENOTBLK is a magic return value which callers may use for falling back
to buffered-io. ->iomap_end()/->iomap_begin() can decide to return this
magic return value if it decides to fallback to buffered-io. iomap
subsystem return this value in case if it fails to invalidate the
pagecache pages belonging to the direct-io range before initiating the
direct-io.

-EIOCBQUEUED is returned when an async direct-io request is queued for I/O. 

> +If a filesystem wishes to perform its own work before direct I/O
> +completion, it should call ``__iomap_dio_rw``.
> +If its return value is not an error pointer or a NULL pointer, the
> +filesystem should pass the return value to ``iomap_dio_complete`` after
> +finishing its internal work.
> +
> +Direct Reads
> +~~~~~~~~~~~~
> +
> +A direct I/O read initiates a read I/O from the storage device to the
> +caller's buffer.
> +Dirty parts of the pagecache are flushed to storage before initiating
> +the read io.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT`` with
> +any combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Read if mapping data are already in memory.
> +   Does not initiate other I/O or block on filesystem locks.
> +
> +Callers commonly hold ``i_rwsem`` in shared mode.
> +
> +Direct Writes
> +~~~~~~~~~~~~~
> +
> +A direct I/O write initiates a write I/O to the storage device to the
> +caller's buffer.

to the storage device "from" the caller's buffer.

> +Dirty parts of the pagecache are flushed to storage before initiating
> +the write io.
> +The pagecache is invalidated both before and after the write io.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DIRECT |
> +IOMAP_WRITE`` with any combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Write if mapping data are already in memory.
> +   Does not initiate other I/O or block on filesystem locks.
> + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
> +   blocks is not allowed.
> +   The entire file range must to map to a single written or unwritten
> +   extent.
> +   The file I/O range must be aligned to the filesystem block size.
> +
> +Callers commonly hold ``i_rwsem`` in shared or exclusive mode.
> +
> +struct iomap_dio_ops:
> +~~~~~~~~~~~~~~~~~~~~~
> +.. code-block:: c
> +
> + struct iomap_dio_ops {
> +     void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
> +                       loff_t file_offset);
> +     int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
> +                   unsigned flags);
> +     struct bio_set *bio_set;
> + };
> +
> +The fields of this structure are as follows:
> +
> +  - ``submit_io``: iomap calls this function when it has constructed a
> +    ``struct bio`` object for the I/O requested, and wishes to submit it
> +    to the block device.
> +    If no function is provided, ``submit_bio`` will be called directly.
> +    Filesystems that would like to perform additional work before (e.g.
> +    data replication for btrfs) should implement this function.
> +
> +  - ``end_io``: This is called after the ``struct bio`` completes.
> +    This function should perform post-write conversions of unwritten
> +    extent mappings, handle write failures, etc.
> +    The ``flags`` argument may be set to a combination of the following:
> +
> +    * ``IOMAP_DIO_UNWRITTEN``: The mapping was unwritten, so the ioend
> +      should mark the extent as written.
> +
> +    * ``IOMAP_DIO_COW``: Writing to the space in the mapping required a
> +      copy on write operation, so the ioend should switch mappings.
> +
> +  - ``bio_set``: This allows the filesystem to provide a custom bio_set
> +    for allocating direct I/O bios.
> +    This enables filesystems to `stash additional per-bio information
> +    <https://lore.kernel.org/all/20220505201115.937837-3-hch@lst.de/>`_
> +    for private use.
> +    If this field is NULL, generic ``struct bio`` objects will be used.
> +
> +Filesystems that want to perform extra work after an I/O completion
> +should set a custom ``->bi_end_io`` function via ``->submit_io``.
> +Afterwards, the custom endio function must call
> +``iomap_dio_bio_end_io`` to finish the direct I/O.
> +
> +DAX I/O
> +-------
> +
> +Storage devices that can be directly mapped as memory support a new
> +access mode known as "fsdax".

Added a comma before "support" for better readability.

Storage devices that can be directly mapped as memory, support a new
access mode known as "fsdax".


> +
> +fsdax Reads
> +~~~~~~~~~~~
> +
> +A fsdax read performs a memcpy from storage device to the caller's
> +buffer.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX`` with any
> +combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Read if mapping data are already in memory.
> +   Does not initiate other I/O or block on filesystem locks.
> +
> +Callers commonly hold ``i_rwsem`` in shared mode.
>   +
> +fsdax Writes
> +~~~~~~~~~~~~
> +
> +A fsdax write initiates a memcpy to the storage device to the caller's

"from" the storage device

> +buffer.
> +The ``flags`` value for ``->iomap_begin`` will be ``IOMAP_DAX |
> +IOMAP_WRITE`` with any combination of the following enhancements:
> +
> + * ``IOMAP_NOWAIT``: Write if mapping data are already in memory.
> +   Does not initiate other I/O or block on filesystem locks.
> +
> + * ``IOMAP_OVERWRITE_ONLY``: Allocating blocks and zeroing partial
> +   blocks is not allowed.
> +   The entire file range must to map to a single written or unwritten
> +   extent.
> +   The file I/O range must be aligned to the filesystem block size.
> +
> +Callers commonly hold ``i_rwsem`` in exclusive mode.
> +
> +mmap Faults
> +~~~~~~~~~~~
> +
> +The ``dax_iomap_fault`` function handles read and write faults to fsdax
> +storage.
> +For a read fault, ``IOMAP_DAX | IOMAP_FAULT`` will be passed as the
> +``flags`` argument to ``->iomap_begin``.
> +For a write fault, ``IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE`` will be
> +passed as the ``flags`` argument to ``->iomap_begin``.
> +
> +Callers commonly hold the same locks as they do to call their iomap
> +pagecache counterparts.
> +
> +Truncation, fallocate, and Unsharing
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> +For fsdax files, the following functions are provided to replace their
> +iomap pagecache I/O counterparts.
> +The ``flags`` argument to ``->iomap_begin`` are the same as the
> +pagecache counterparts, with ``IOMAP_DIO`` added.

with "IOMAP_DAX"


> +
> + * ``dax_file_unshare``
> + * ``dax_zero_range``
> + * ``dax_truncate_page``

Shall we mention
"dax_remap_file_range_prep()/dax_dedupe_file_range_compare()" ?

> +
> +Callers commonly hold the same locks as they do to call their iomap
> +pagecache counterparts.
> +

Stopping here for now. Will resume the rest of the document review
later.

-ritesh

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ