linux-ext4 - Re: [PATCH] Documentation: document the design of iomap and how to port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <875xug9dyt.fsf@gmail.com>
Date: Tue, 11 Jun 2024 12:13:22 +0530
From: Ritesh Harjani (IBM) <ritesh.list@...il.com>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org, linux-fsdevel@...r.kernel.org, Dave Chinner <david@...morbit.com>, Matthew Wilcox <willy@...radead.org>, Christoph Hellwig <hch@...radead.org>, Christian Brauner <brauner@...nel.org>, Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>, Luis Chamberlain <mcgrof@...nel.org>
Subject: Re: [PATCH] Documentation: document the design of iomap and how to port

"Darrick J. Wong" <djwong@...nel.org> writes:

> On Mon, Jun 10, 2024 at 02:27:28PM +0530, Ritesh Harjani wrote:
>> 
>> Hello Darrick,
>> 
>> "Darrick J. Wong" <djwong@...nel.org> writes:
>> 
>> > From: Darrick J. Wong <djwong@...nel.org>
>> >
>> > This is the fourth attempt at documenting the design of iomap and how to
>> 
>> I agree that this isn't needed in the commit msg ("fourth attempt").
>
> Ok.  "Coapture the design of iomap and how to port..."
>
>> > port filesystems to use it.  Apologies for all the rst formatting, but
>> > it's necessary to distinguish code from regular text.
>> >
>> > A lot of this has been collected from various email conversations, code
>> > comments, commit messages, my own understanding of iomap, and
>> > Ritesh/Luis' previous efforts to create a document.  Please note a large
>> > part of this has been taken from Dave's reply to last iomap doc
>> > patchset. Thanks to Ritesh, Luis, Dave, Darrick, Matthew, Christoph and
>> > other iomap developers who have taken time to explain the iomap design
>> > in various emails, commits, comments etc.
>> >
>> > Cc: Dave Chinner <david@...morbit.com>
>> > Cc: Matthew Wilcox <willy@...radead.org>
>> > Cc: Christoph Hellwig <hch@...radead.org>
>> > Cc: Christian Brauner <brauner@...nel.org>
>> > Cc: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
>> > Cc: Jan Kara <jack@...e.cz>
>> > Cc: Luis Chamberlain <mcgrof@...nel.org>
>> > Inspired-by: Ritesh Harjani (IBM) <ritesh.list@...il.com>
>> 
>> I am not sure if this is even a valid or accepted tag.
>> But sure thanks! :)
>
> They're freeform tags, so they can be everything everyone wants them to
> be!  Drum circle kumbaya etc. :P
>
>> > Signed-off-by: Darrick J. Wong <djwong@...nel.org>
>> > ---
>> >  Documentation/filesystems/index.rst |    1 
>> >  Documentation/filesystems/iomap.rst | 1060 +++++++++++++++++++++++++++++++++++
>> >  MAINTAINERS                         |    1 
>> >  3 files changed, 1062 insertions(+)
>> >  create mode 100644 Documentation/filesystems/iomap.rst
>> >
>> > diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst
>> > index 8f5c1ee02e2f..b010cc8df32d 100644
>> > --- a/Documentation/filesystems/index.rst
>> > +++ b/Documentation/filesystems/index.rst
>> > @@ -34,6 +34,7 @@ algorithms work.
>> >     seq_file
>> >     sharedsubtree
>> >     idmappings
>> > +   iomap
>> >  
>> >     automount-support
>> >  
>> > diff --git a/Documentation/filesystems/iomap.rst b/Documentation/filesystems/iomap.rst
>> > new file mode 100644
>> > index 000000000000..a478b55e4135
>> > --- /dev/null
>> > +++ b/Documentation/filesystems/iomap.rst
>> > @@ -0,0 +1,1060 @@
>> > +.. SPDX-License-Identifier: GPL-2.0
>> > +.. _iomap:
>> > +
>> > +..
>> > +        Dumb style notes to maintain the author's sanity:
>> > +        Please try to start sentences on separate lines so that
>> > +        sentence changes don't bleed colors in diff.
>> > +        Heading decorations are documented in sphinx.rst.
>> > +
>> > +============================
>> > +VFS iomap Design and Porting
>> > +============================
>> > +
>> > +.. toctree::
>> > +
>> > +Introduction
>> > +============
>> > +
>> > +iomap is a filesystem library for handling various filesystem operations
>> > +that involves mapping of file's logical offset ranges to physical
>> > +extents.
>> > +This origins of this library is the file I/O path that XFS once used; it
>> > +has now been extended to cover several other operations.
>> > +The library provides various APIs for implementing various file and
>>                         ^^^^ redundant "various"
>> 
>> > +pagecache operations, such as:
>> > +
>> > + * Pagecache reads and writes
>> > + * Folio write faults to the pagecache
>> > + * Writeback of dirty folios
>> > + * Direct I/O reads and writes
>> 
>> Dax I/O reads and writes.
>> ... as well please?
>
> It's really fsdax I/O reads, writes, loads, and stores, isn't it?
>

It felt like dax_iomap_rw() belongs to fs/iomap. 
But nevertheless, we could skip it if we are targetting fs/iomap/
lib.

>> 
>> > + * FIEMAP
>> > + * lseek ``SEEK_DATA`` and ``SEEK_HOLE``
>> > + * swapfile activation
>> > +
>> 
>> > +Who Should Read This?
>> > +=====================
>> > +
>> > +The target audience for this document are filesystem, storage, and
>> 
>> /s/and/,/
>> 
>> 
>> > +pagecache programmers and code reviewers.
>> 
>> Not sure if we even need this secion "Who Should Read This".
>
> That was a review comment from Luis' attempt to write this document:
> https://lore.kernel.org/linux-xfs/87zg61p78x.fsf@meer.lwn.net/
>
> Also I guess I should state explicitly:
>
> "If you are working on PCI, machine architectures, or device drivers,
> you are most likely in the wrong place."
>
>> > +The goal of this document is to provide a brief discussion of the
>> > +design and capabilities of iomap, followed by a more detailed catalog
>> > +of the interfaces presented by iomap.
>> > +If you change iomap, please update this design document.
>> 
>> The details of "goal of this document..." -> can be a part of
>> separate paragraph in "Introduction" section itself.
>> 
>> > +
>> > +But Why?
>> > +========
>> 
>> "Why Iomap?" is more clean IMO. 
>
> "Why VFS iomap?", then.
>
>> > +
>> > +Unlike the classic Linux I/O model which breaks file I/O into small
>> > +units (generally memory pages or blocks) and looks up space mappings on
>> > +the basis of that unit, the iomap model asks the filesystem for the
>> > +largest space mappings that it can create for a given file operation and
>> > +initiates operations on that basis.
>> > +This strategy improves the filesystem's visibility into the size of the
>> > +operation being performed, which enables it to combat fragmentation with
>> > +larger space allocations when possible.
>> > +Larger space mappings improve runtime performance by amortizing the cost
>> > +of a mapping function call into the filesystem across a larger amount of
>> 
>> s/call/calls
>
> Done.
>
>> > +data.
>> > +
>> > +At a high level, an iomap operation `looks like this
>> > +<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_:
>> > +
>> > +1. For each byte in the operation range...
>> > +
>> > +   1. Obtain space mapping via ->iomap_begin
>> > +   2. For each sub-unit of work...
>> > +
>> > +      1. Revalidate the mapping and go back to (1) above, if necessary
>> > +      2. Do the work
>> > +
>> > +   3. Increment operation cursor
>> > +   4. Release the mapping via ->iomap_end, if necessary
>> > +
>> > +Each iomap operation will be covered in more detail below.
>> > +This library was covered previously by an `LWN article
>> > +<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page
>> > +<https://kernelnewbies.org/KernelProjects/iomap>`_.
>> > +
>> > +Data Structures and Algorithms
>> > +==============================
>> > +
>> > +Definitions
>> > +-----------
>> > +
>> > + * ``bufferhead``: Shattered remnants of the old buffer cache.
>> > + * ``fsblock``: The block size of a file, also known as ``i_blocksize``.
>> > + * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore.
>> > + * ``invalidate_lock``: The pagecache ``struct address_space``
>> > +   rwsemaphore that protects against folio removal.
>> 
>> This definition is a bit confusing & maybe even incomplete.
>> I think we should use this from header file.
>> 
>> @invalidate_lock: The pagecache ``struct address_sapce`` rwsemaphore
>>   that guards coherency between page cache contents and file offset->disk
>>   block mappings in the filesystem during invalidates. It is also used to
>>   block modification of page cache contents through memory mappings.
>> 
>> Also if we are describing definitions above - then I think we should
>> also clarify these locks/terms used in this document (I just looked
>> "lock" related terms for now)
>> 
>> - folio lock:
>> - dax lock:
>
> Er, what /is/ the dax lock?  Is that the dax_read_lock thing that (I
> think) wraps the dax rcu lock?  Which in turn exists so that we don't
> return from the dax offlining function before everyone's dropped all
> their references to internal structures?
>
>> - pagecache lock: 
>> - FS internal mapping lock: 
>> - Iomap internal operation lock:
>> 
>> > +
>> > +struct iomap_ops
>> > +----------------
>> 
>> IMO, we should define "struct iomap" in the begining. The reason is
>> because iomap_ops functions take "struct iomap" in it's function
>> arguments. So it's easier if we describe "struct iomap" before.
>
> Yeah, I agree.
>
>> > +
>> > +Every iomap function requires the filesystem to pass an operations
>> > +structure to obtain a mapping and (optionally) to release the mapping.
>> > +
>> > +.. code-block:: c
>> > +
>> > + struct iomap_ops {
>> > +     int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length,
>> > +                        unsigned flags, struct iomap *iomap,
>> > +                        struct iomap *srcmap);
>> > +
>> > +     int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length,
>> > +                      ssize_t written, unsigned flags,
>> > +                      struct iomap *iomap);
>> > + };
>> > +
>> > +The ``->iomap_begin`` function is called to obtain one mapping for the
>> > +range of bytes specified by ``pos`` and ``length`` for the file
>> > +``inode``.
>> 
>> I think it is better if we describe ->iomap_begin and ->iomap_end
>> in proper sub-sections. Otherwise this looks like we have clobbered
>> all the information together :)
>> 
>> ->iomap_begin 
>> ^^^^^^^^^^^^^^^^^
>
> Yes, I like the explicit section headings better.
>

yup.

>> This either returns an existing mapping or reserve/allocates a new
>> mapping.
>
> That's a filesystem specific detail -- all that iomap cares about is
> that the fs communicates a mapping.  Maybe the fs actually had to do a
> bunch of work to get that mapping, or maybe it's already laid out
> statically, ala zonefs.  Either way, it's not a concern of the iomap
> library.
>
>> logical file pos and length are in bytes which gets passed
>> as function arguments. Filesystem returns the new mapping information
>> within ``struct iomap`` which also gets passed as a function argument.
>> Filesystems should provide the details of this mapping by filling
>> various fields within ``struct iomap``.
>
> "iomap operations call ->iomap_begin to obtain one file mapping for the
> range of bytes specified by pos and length for the file inode.  This
> mapping should be returned through the iomap pointer.  The mapping must
> cover at least the first byte of the supplied file range, but it does
> not need to cover the entire requested range."
>

I like it. Thanks for adding that detail in the last line.

>>   @srcmap agument:
>>     Note that ->iomap_begin call has srcmap passed as another argument. This is
>>     mainly used only during the begin phase for COW mappings to identify where
>>     the reads are to be performed from. Filesystems needs to fill that mapping
>>     information if iomap should read data for partially written blocks from a
>>     different location than the write target [4].
>>   @flags argument:
>>     These are the operation types which iomap supports. 
>>     IOMAP_WRITE: For doing write I/O.
>>     <...>
>>     IOMAP_ZERO:
>>     IOMAP_REPORT:		
>>     IOMAP_FAULT:		
>>     IOMAP_DIRECT:		
>>     IOMAP_NOWAIT:		
>>     IOMAP_OVERWRITE_ONLY:
>>     IOMAP_UNSHARE:
>>     IOMAP_DAX:	
>
> I think it's /much/ more valuable to document the exact combinations
> that will be passed to ->iomap_begin further down where we talk about
> specific operations that iomap performs.
>
> Otherwise, someone is going to look at this list and wonder if they
> really need to figure out what IOMAP_ZERO|IOMAP_FAULT|IOMAP_DAX means,
> and if it's actually possible (it's not).
>

Sure.

>> 
>> ->iomap_end
>> ^^^^^^^^^^^^^^^^^
>> 
>> Commit and/or unreserve space which was previously allocated/reserved
>> in ``->iomap_begin``. For e.g. during buffered-io, when a short writes
>> occurs, filesystem may need to remove the reserved space that was
>> allocated during ->iomap_begin.
>> For filesystems that use delalloc allocation, we may need to punch out
>> delalloc extents from the range that are not dirty in
>> the page cache. See comments in
>> iomap_file_buffered_write_punch_delalloc() for more info [5][6].
>> 
>> (IMHO) I find above definitions more descriptive.
>
> I don't want to merge the general description with pagecache specific
> areas.  They already cover punch_delalloc.
>

sure.

>> > +
>> > +Each iomap operation describes the requested operation through the
>> > +``flags`` argument.
>> > +The exact value of ``flags`` will be documented in the
>> > +operation-specific sections below, but these principles apply generally:
>> > +
>> > + * For a write operation, ``IOMAP_WRITE`` will be set.
>> > +   Filesystems must not return ``IOMAP_HOLE`` mappings.
>> > +
>> > + * For any other operation, ``IOMAP_WRITE`` will not be set.
>> > +
>> 
>> Direct-io related operation which bypasses pagecache use IOMAP_DIRECT.
>
> That's covered in the pagecache/directio/dax subsection because I wanted
> to document specific combinations that filesystem authors should expect.
>

The points mentioned above were targetting buffered-io, dax, so I
thought we could add direct-io related flag as well here.

>> > + * For any operation targetting direct access to storage (fsdax),
>> > +   ``IOMAP_DAX`` will be set.
>> > +
>> > +If it is necessary to read existing file contents from a `different
>> > +<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ device or
>> > +address range on a device, the filesystem should return that information via
>> > +``srcmap``.
>> > +Only pagecache and fsdax operations support reading from one mapping and
>> > +writing to another.
>> > +
>> > +After the operation completes, the ``->iomap_end`` function, if present,
>> > +is called to signal that iomap is finished with a mapping.
>> > +Typically, implementations will use this function to tear down any
>> > +context that were set up in ``->iomap_begin``.
>> > +For example, a write might wish to commit the reservations for the bytes
>> > +that were operated upon and unreserve any space that was not operated
>> > +upon.
>> > +``written`` might be zero if no bytes were touched.
>> > +``flags`` will contain the same value passed to ``->iomap_begin``.
>> > +iomap ops for reads are not likely to need to supply this function.
>> > +
>> > +Both functions should return a negative errno code on error, or zero.
>> 
>> minor nit: ... or zero on success.
>
> done.
>
>> > +
>> > +struct iomap
>> > +------------
>> > +
>> > +The filesystem returns the mappings via the following structure.
>> 
>> Filesystem returns the contiguous file mapping information of logical
>> file offset range to a physically mapped extent via the following
>> structure which iomap uses to perform various file and pagecache
>> related operations listed above.
>
> How about:
>
> "The filesystem communicates to iomap operations the mappings of byte
> ranges of a file to byte ranges of a storage device with the structure
> below."
>

Sounds good.

>> > +For documentation purposes, the structure has been reordered to group
>> > +fields that go together logically.
>> > +
>> > +.. code-block:: c
>> > +
>> > + struct iomap {
>> > +     loff_t                       offset;
>> > +     u64                          length;
>> > +
>> > +     u16                          type;
>> > +     u16                          flags;
>> > +
>> > +     u64                          addr;
>> > +     struct block_device          *bdev;
>> > +     struct dax_device            *dax_dev;
>> > +     void                         *inline_data;
>> > +
>> > +     void                         *private;
>> > +
>> > +     const struct iomap_folio_ops *folio_ops;
>> > +
>> > +     u64                          validity_cookie;
>> > + };
>> > +
>> > +The information is useful for translating file operations into action.
>> > +The actions taken are specific to the target of the operation, such as
>> > +disk cache, physical storage devices, or another part of the kernel.
>> 
>> I think the wording "action" & trying to make it so generic w/o mapping
>> what "action" refers here for "disk cache", "physical storage device" or
>> "other parts of the kernel", gets a bit confusing.
>> 
>> Do you think we should map those to some examples maybe?
>> BTW, with added definition of "struct iomap" which I mentioned above,
>> I am even fine if we want to drop this paragraph. 
>
> Yeah, I'll delete the paragraph.
>
>> > +
>> > + * ``offset`` and ``length`` describe the range of file offsets, in
>> > +   bytes, covered by this mapping.
>> > +   These fields must always be set by the filesystem.
>> > +
>> > + * ``type`` describes the type of the space mapping:
>> 
>> This field is set by the filesystem in ->iomap_begin call.
>> 
>> > +
>> > +   * **IOMAP_HOLE**: No storage has been allocated.
>> > +     This type must never be returned in response to an IOMAP_WRITE
>> > +     operation because writes must allocate and map space, and return
>> > +     the mapping.
>> > +     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
>> > +     iomap does not support writing (whether via pagecache or direct
>> > +     I/O) to a hole.
>> > +
>> > +   * **IOMAP_DELALLOC**: A promise to allocate space at a later time
>> > +     ("delayed allocation").
>> > +     If the filesystem returns IOMAP_F_NEW here and the write fails, the
>> > +     ``->iomap_end`` function must delete the reservation.
>> > +     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
>> > +
>> > +   * **IOMAP_MAPPED**: The file range maps to specific space on the
>> > +     storage device.
>> > +     The device is returned in ``bdev`` or ``dax_dev``.
>> > +     The device address, in bytes, is returned via ``addr``.
>> > +
>> > +   * **IOMAP_UNWRITTEN**: The file range maps to specific space on the
>> > +     storage device, but the space has not yet been initialized.
>> > +     The device is returned in ``bdev`` or ``dax_dev``.
>> > +     The device address, in bytes, is returned via ``addr``.
>> > +     Reads will return zeroes to userspace.
>> 
>> Reads to this type of mapping will return zeroes to the caller.
>
> Reads from this type of mapping, but yes.
>
>> > +     For a write or writeback operation, the ioend should update the
>> > +     mapping to MAPPED.
>> 
>> Refer to section "Writeback ioend Completion" for more details.
>
> There are two here -- one for the pagecache, and one for directio.
>
>> > +
>> > +   * **IOMAP_INLINE**: The file range maps to the memory buffer
>> > +     specified by ``inline_data``.
>> > +     For write operation, the ``->iomap_end`` function presumably
>> > +     handles persisting the data.
>> 
>> Is it? Or do we just mark the inode as dirty?
>
> gfs2 actually starts a transaction in ->iomap_begin and commits or
> cancels it in ->iomap_end.
>

ok.

>> > +     The ``addr`` field must be set to ``IOMAP_NULL_ADDR``.
>> > +
>> > + * ``flags`` describe the status of the space mapping.
>> > +   These flags should be set by the filesystem in ``->iomap_begin``:
>> > +
>> > +   * **IOMAP_F_NEW**: The space under the mapping is newly allocated.
>> > +     Areas that will not be written to must be zeroed.
>> 
>> In case of DAX, we have to invalidate those existing mappings which
>> might have a "hole" page mapped.
>
> Isn't that an internal detail of the fs/dax.c code?  The filesystem
> doesn't have to do the invalidation or even know about hole pages.
>

Right. Sorry about that. I assumed dax_iomap_rw() implementation
is a part of iomap :)

>> > +     If a write fails and the mapping is a space reservation, the
>> > +     reservation must be deleted.
>> > +
>> > +   * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed
>> > +     to access any data written.
>> > +     fdatasync is required to commit these changes to persistent
>> > +     storage.
>> > +     This needs to take into account metadata changes that *may* be made
>> > +     at I/O completion, such as file size updates from direct I/O.
>> > +
>> > +   * **IOMAP_F_SHARED**: The space under the mapping is shared.
>> > +     Copy on write is necessary to avoid corrupting other file data.
>> > +
>> > +   * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer
>> > +     heads for pagecache operations.
>> > +     Do not add more uses of this.
>> > +
>> > +   * **IOMAP_F_MERGED**: Multiple contiguous block mappings were
>> > +     coalesced into this single mapping.
>> > +     This is only useful for FIEMAP.
>> > +
>> > +   * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not
>> > +     regular file data.
>> > +     This is only useful for FIEMAP.
>> > +
>> > +   * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can
>> > +     be set by the filesystem for its own purposes.
>> > +
>> > +   These flags can be set by iomap itself during file operations.
>> > +   The filesystem should supply an ``->iomap_end`` function to observe
>> > +   these flags:
>> > +
>> > +   * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of
>> > +     using this mapping.
>> > +
>> > +   * **IOMAP_F_STALE**: The mapping was found to be stale.
>> > +     iomap will call ``->iomap_end`` on this mapping and then
>> > +     ``->iomap_begin`` to obtain a new mapping.
>> > +
>> > +   Currently, these flags are only set by pagecache operations.
>> > +
>> > + * ``addr`` describes the device address, in bytes.
>> > +
>> > + * ``bdev`` describes the block device for this mapping.
>> > +   This only needs to be set for mapped or unwritten operations.
>> > +
>> > + * ``dax_dev`` describes the DAX device for this mapping.
>> > +   This only needs to be set for mapped or unwritten operations, and
>> > +   only for a fsdax operation.
>> 
>> Looks like we can make this union (bdev and dax_dev). Since depending
>> upon IOMAP_DAX or not we only set either dax_dev or bdev.
>
> Separate patch. ;)
>

Yes, in a way I was trying to get an opinion from you and others on
whether it make sense to make bdev and dax_dev as union :)

Looks like this series [1] could be the reason for that. 

[1]: https://lore.kernel.org/all/20211129102203.2243509-1-hch@lst.de/#t

I also don't see any reference to dax code from fs/iomap/buffered-io.c
So maybe we don't need this dax.h header in this file.

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c5802a459334..e1a6cca3cec2 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -10,7 +10,6 @@
 #include <linux/pagemap.h>
 #include <linux/uio.h>
 #include <linux/buffer_head.h>
-#include <linux/dax.h>
 #include <linux/writeback.h>
 #include <linux/list_sort.h>
 #include <linux/swap.h>
 

>> Sorry Darrick. I will stop here for now.
>> I will continue it from here later.
>
> Ok.  The rest of the doc will hopefully make it more obvious why there's
> the generic discussion up here.
>

Sure. I am going through it.

-ritesh