linux-ext4 - [TEXT 2/3] file operations

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240614214702.GM6125@frogsfrogsfrogs>
Date: Fri, 14 Jun 2024 14:47:02 -0700
From: "Darrick J. Wong" <djwong@...nel.org>
To: "Ritesh Harjani (IBM)" <ritesh.list@...il.com>
Cc: linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, Dave Chinner <david@...morbit.com>,
	Matthew Wilcox <willy@...radead.org>,
	Christoph Hellwig <hch@...radead.org>,
	Christian Brauner <brauner@...nel.org>,
	Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>,
	Luis Chamberlain <mcgrof@...nel.org>
Subject: [TEXT 2/3] file operations

Here's the file operations provided by the upper layer of iomap.
https://djwong.org/docs/iomap/operations.html

--D

                       Supported File Operations

   Table of Contents

     * Buffered I/O
          * struct address_space_operations
          * struct iomap_folio_ops
          * Internal per-Folio State
          * Buffered Readahead and Reads
          * Buffered Writes
               * mmap Write Faults
               * Buffered Write Failures
               * Zeroing for File Operations
               * Unsharing Reflinked File Data
          * Truncation
          * Pagecache Writeback
               * struct iomap_writeback_ops
               * Pagecache Writeback Completion
     * Direct I/O
          * Return Values
          * Direct Reads
          * Direct Writes
          * struct iomap_dio_ops:
     * DAX I/O
          * fsdax Reads
          * fsdax Writes
               * fsdax mmap Faults
          * fsdax Truncation, fallocate, and Unsharing
          * fsdax Deduplication
     * Seeking Files
          * SEEK_DATA
          * SEEK_HOLE
     * Swap File Activation
     * File Space Mapping Reporting
          * FS_IOC_FIEMAP
          * FIBMAP (deprecated)

   Below are a discussion of the high level file operations that
   iomap implements.

                              Buffered I/O

   Buffered I/O is the default file I/O path in Linux. File contents
   are cached in memory ("pagecache") to satisfy reads and writes.
   Dirty cache will be written back to disk at some point that can be
   forced via fsync and variants.

   iomap implements nearly all the folio and pagecache management
   that filesystems have to implement themselves under the legacy I/O
   model. This means that the filesystem need not know the details of
   allocating, mapping, managing uptodate and dirty state, or
   writeback of pagecache folios. Under the legacy I/O model, this
   was managed very inefficiently with linked lists of buffer heads
   instead of the per-folio bitmaps that iomap uses. Unless the
   filesystem explicitly opts in to buffer heads, they will not be
   used, which makes buffered I/O much more efficient, and the
   pagecache maintainer much happier.

struct address_space_operations

   The following iomap functions can be referenced directly from the
   address space operations structure:

       * iomap_dirty_folio
       * iomap_release_folio
       * iomap_invalidate_folio
       * iomap_is_partially_uptodate

   The following address space operations can be wrapped easily:

       * read_folio
       * readahead
       * writepages
       * bmap
       * swap_activate

struct iomap_folio_ops

   The ->iomap_begin function for pagecache operations may set the
   struct iomap::folio_ops field to an ops structure to override
   default behaviors of iomap:

 struct iomap_folio_ops {
     struct folio *(*get_folio)(struct iomap_iter *iter, loff_t pos,
                                unsigned len);
     void (*put_folio)(struct inode *inode, loff_t pos, unsigned copied,
                       struct folio *folio);
     bool (*iomap_valid)(struct inode *inode, const struct iomap *iomap);
 };

   iomap calls these functions:

       * get_folio: Called to allocate and return an active reference
         to a locked folio prior to starting a write. If this
         function is not provided, iomap will call iomap_get_folio.
         This could be used to set up per-folio filesystem state for
         a write.

       * put_folio: Called to unlock and put a folio after a
         pagecache operation completes. If this function is not
         provided, iomap will folio_unlock and folio_put on its own.
         This could be used to commit per-folio filesystem state that
         was set up by ->get_folio.

       * iomap_valid: The filesystem may not hold locks between
         ->iomap_begin and ->iomap_end because pagecache operations
         can take folio locks, fault on userspace pages, initiate
         writeback for memory reclamation, or engage in other
         time-consuming actions. If a file's space mapping data are
         mutable, it is possible that the mapping for a particular
         pagecache folio can change in the time it takes to allocate,
         install, and lock that folio.

         For the pagecache, races can happen if writeback doesn't
         take i_rwsem or invalidate_lock and updates mapping
         information. Races can also happen if the filesytem allows
         concurrent writes. For such files, the mapping must be
         revalidated after the folio lock has been taken so that
         iomap can manage the folio correctly.

         fsdax does not need this revalidation because there's no
         writeback and no support for unwritten extents.

         Filesystems subject to this kind of race must provide a
         ->iomap_valid function to decide if the mapping is still
         valid. If the mapping is not valid, the mapping will be
         sampled again.

         To support making the validity decision, the filesystem's
         ->iomap_begin function may set struct iomap::validity_cookie
         at the same time that it populates the other iomap fields. A
         simple validation cookie implementation is a sequence
         counter. If the filesystem bumps the sequence counter every
         time it modifies the inode's extent map, it can be placed in
         the struct iomap::validity_cookie during ->iomap_begin. If
         the value in the cookie is found to be different to the
         value the filesystem holds when the mapping is passed back
         to ->iomap_valid, then the iomap should considered stale and
         the validation failed.

   These struct kiocb flags are significant for buffered I/O with
   iomap:

       * IOCB_NOWAIT: Turns on IOMAP_NOWAIT.

Internal per-Folio State

   If the fsblock size matches the size of a pagecache folio, it is
   assumed that all disk I/O operations will operate on the entire
   folio. The uptodate (memory contents are at least as new as what's
   on disk) and dirty (memory contents are newer than what's on disk)
   status of the folio are all that's needed for this case.

   If the fsblock size is less than the size of a pagecache folio,
   iomap tracks the per-fsblock uptodate and dirty state itself. This
   enables iomap to handle both "bs < ps" filesystems and large
   folios in the pagecache.

   iomap internally tracks two state bits per fsblock:

       * uptodate: iomap will try to keep folios fully up to date. If
         there are read(ahead) errors, those fsblocks will not be
         marked uptodate. The folio itself will be marked uptodate
         when all fsblocks within the folio are uptodate.
       * dirty: iomap will set the per-block dirty state when
         programs write to the file. The folio itself will be marked
         dirty when any fsblock within the folio is dirty.

   iomap also tracks the amount of read and write disk IOs that are
   in flight. This structure is much lighter weight than struct
   buffer_head because there is only one per folio, and the
   per-fsblock overhead is two bits vs. 104 bytes.

   Filesystems wishing to turn on large folios in the pagecache
   should call mapping_set_large_folios when initializing the incore
   inode.

Buffered Readahead and Reads

   The iomap_readahead function initiates readahead to the pagecache.
   The iomap_read_folio function reads one folio's worth of data into
   the pagecache. The flags argument to ->iomap_begin will be set to
   zero. The pagecache takes whatever locks it needs before calling
   the filesystem.

Buffered Writes

   The iomap_file_buffered_write function writes an iocb to the
   pagecache. IOMAP_WRITE or IOMAP_WRITE | IOMAP_NOWAIT will be
   passed as the flags argument to ->iomap_begin. Callers commonly
   take i_rwsem in either shared or exclusive mode before calling
   this function.

  mmap Write Faults

   The iomap_page_mkwrite function handles a write fault to a folio
   in the pagecache. IOMAP_WRITE | IOMAP_FAULT will be passed as the
   flags argument to ->iomap_begin. Callers commonly take the mmap
   invalidate_lock in shared or exclusive mode before calling this
   function.

  Buffered Write Failures

   After a short write to the pagecache, the areas not written will
   not become marked dirty. The filesystem must arrange to cancel
   such reservations because writeback will not consume the
   reservation. The iomap_file_buffered_write_punch_delalloc can be
   called from a ->iomap_end function to find all the clean areas of
   the folios caching a fresh (IOMAP_F_NEW) delalloc mapping. It
   takes the invalidate_lock.

   The filesystem must supply a function punch to be called for each
   file range in this state. This function must only remove delayed
   allocation reservations, in case another thread racing with the
   current thread writes successfully to the same region and triggers
   writeback to flush the dirty data out to disk.

  Zeroing for File Operations

   Filesystems can call iomap_zero_range to perform zeroing of the
   pagecache for non-truncation file operations that are not aligned
   to the fsblock size. IOMAP_ZERO will be passed as the flags
   argument to ->iomap_begin. Callers typically hold i_rwsem and
   invalidate_lock in exclusive mode before calling this function.

  Unsharing Reflinked File Data

   Filesystems can call iomap_file_unshare to force a file sharing
   storage with another file to preemptively copy the shared data to
   newly allocate storage. IOMAP_WRITE | IOMAP_UNSHARE will be passed
   as the flags argument to ->iomap_begin. Callers typically hold
   i_rwsem and invalidate_lock in exclusive mode before calling this
   function.

Truncation

   Filesystems can call iomap_truncate_page to zero the bytes in the
   pagecache from EOF to the end of the fsblock during a file
   truncation operation. truncate_setsize or truncate_pagecache will
   take care of everything after the EOF block. IOMAP_ZERO will be
   passed as the flags argument to ->iomap_begin. Callers typically
   hold i_rwsem and invalidate_lock in exclusive mode before calling
   this function.

Pagecache Writeback

   Filesystems can call iomap_writepages to respond to a request to
   write dirty pagecache folios to disk. The mapping and wbc
   parameters should be passed unchanged. The wpc pointer should be
   allocated by the filesystem and must be initialized to zero.

   The pagecache will lock each folio before trying to schedule it
   for writeback. It does not lock i_rwsem or invalidate_lock.

   The dirty bit will be cleared for all folios run through the
   ->map_blocks machinery described below even if the writeback
   fails. This is to prevent dirty folio clots when storage devices
   fail; an -EIO is recorded for userspace to collect via fsync.

   The ops structure must be specified and is as follows:

  struct iomap_writeback_ops

 struct iomap_writeback_ops {
     int (*map_blocks)(struct iomap_writepage_ctx *wpc, struct inode *inode,
                       loff_t offset, unsigned len);
     int (*prepare_ioend)(struct iomap_ioend *ioend, int status);
     void (*discard_folio)(struct folio *folio, loff_t pos);
 };

   The fields are as follows:

       * map_blocks: Sets wpc->iomap to the space mapping of the file
         range (in bytes) given by offset and len. iomap calls this
         function for each dirty fs block in each dirty folio, though
         it will reuse mappings for runs of contiguous dirty fsblocks
         within a folio. Do not return IOMAP_INLINE mappings here;
         the ->iomap_end function must deal with persisting written
         data. Do not return IOMAP_DELALLOC mappings here; iomap
         currently requires mapping to allocated space. Filesystems
         can skip a potentially expensive mapping lookup if the
         mappings have not changed. This revalidation must be
         open-coded by the filesystem; it is unclear if
         iomap::validity_cookie can be reused for this purpose. This
         function must be supplied by the filesystem.
       * prepare_ioend: Enables filesystems to transform the
         writeback ioend or perform any other preparatory work before
         the writeback I/O is submitted. This might include pre-write
         space accounting updates, or installing a custom ->bi_end_io
         function for internal purposes, such as deferring the ioend
         completion to a workqueue to run metadata update
         transactions from process context. This function is
         optional.
       * discard_folio: iomap calls this function after ->map_blocks
         fails to schedule I/O for any part of a dirty folio. The
         function should throw away any reservations that may have
         been made for the write. The folio will be marked clean and
         an -EIO recorded in the pagecache. Filesystems can use this
         callback to remove delalloc reservations to avoid having
         delalloc reservations for clean pagecache. This function is
         optional.

  Pagecache Writeback Completion

   To handle the bookkeeping that must happen after disk I/O for
   writeback completes, iomap creates chains of struct iomap_ioend
   objects that wrap the bio that is used to write pagecache data to
   disk. By default, iomap finishes writeback ioends by clearing the
   writeback bit on the folios attached to the ioend. If the write
   failed, it will also set the error bits on the folios and the
   address space. This can happen in interrupt or process context,
   depending on the storage device.

   Filesystems that need to update internal bookkeeping (e.g.
   unwritten extent conversions) should provide a ->prepare_ioend
   function to set struct iomap_end::bio::bi_end_io to its own
   function. This function should call iomap_finish_ioends after
   finishing its own work (e.g. unwritten extent conversion).

   Some filesystems may wish to amortize the cost of running metadata
   transactions for post-writeback updates by batching them. They may
   also require transactions to run from process context, which
   implies punting batches to a workqueue. iomap ioends contain a
   list_head to enable batching.

   Given a batch of ioends, iomap has a few helpers to assist with
   amortization:

       * iomap_sort_ioends: Sort all the ioends in the list by file
         offset.
       * iomap_ioend_try_merge: Given an ioend that is not in any
         list and a separate list of sorted ioends, merge as many of
         the ioends from the head of the list into the given ioend.
         ioends can only be merged if the file range and storage
         addresses are contiguous; the unwritten and shared status
         are the same; and the write I/O outcome is the same. The
         merged ioends become their own list.
       * iomap_finish_ioends: Finish an ioend that possibly has other
         ioends linked to it.

                               Direct I/O

   In Linux, direct I/O is defined as file I/O that is issued
   directly to storage, bypassing the pagecache. The iomap_dio_rw
   function implements O_DIRECT (direct I/O) reads and writes for
   files.

 ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
                      const struct iomap_ops *ops,
                      const struct iomap_dio_ops *dops,
                      unsigned int dio_flags, void *private,
                      size_t done_before);

   The filesystem can provide the dops parameter if it needs to
   perform extra work before or after the I/O is issued to storage.
   The done_before parameter tells the how much of the request has
   already been transferred. It is used to continue a request
   asynchronously when part of the request has already been completed
   synchronously.

   The done_before parameter should be set if writes for the iocb
   have been initiated prior to the call. The direction of the I/O is
   determined from the iocb passed in.

   The dio_flags argument can be set to any combination of the
   following values:

       * IOMAP_DIO_FORCE_WAIT: Wait for the I/O to complete even if
         the kiocb is not synchronous.
       * IOMAP_DIO_OVERWRITE_ONLY: Perform a pure overwrite for this
         range or fail with -EAGAIN. This can be used by filesystems
         with complex unaligned I/O write paths to provide an
         optimised fast path for unaligned writes. If a pure
         overwrite can be performed, then serialisation against other
         I/Os to the same filesystem block(s) is unnecessary as there
         is no risk of stale data exposure or data loss. If a pure
         overwrite cannot be performed, then the filesystem can
         perform the serialisation steps needed to provide exclusive
         access to the unaligned I/O range so that it can perform
         allocation and sub-block zeroing safely. Filesystems can use
         this flag to try to reduce locking contention, but a lot of
         detailed checking is required to do it correctly.
       * IOMAP_DIO_PARTIAL: If a page fault occurs, return whatever
         progress has already been made. The caller may deal with the
         page fault and retry the operation. If the caller decides to
         retry the operation, it should pass the accumulated return
         values of all previous calls as the done_before parameter to
         the next call.

   These struct kiocb flags are significant for direct I/O with
   iomap:

       * IOCB_NOWAIT: Turns on IOMAP_NOWAIT.
       * IOCB_SYNC: Ensure that the device has persisted data to disk
         before completing the call. In the case of pure overwrites,
         the I/O may be issued with FUA enabled.
       * IOCB_HIPRI: Poll for I/O completion instead of waiting for
         an interrupt. Only meaningful for asynchronous I/O, and only
         if the entire I/O can be issued as a single struct bio.
       * IOCB_DIO_CALLER_COMP: Try to run I/O completion from the
         caller's process context. See linux/fs.h for more details.

   Filesystems should call iomap_dio_rw from ->read_iter and
   ->write_iter, and set FMODE_CAN_ODIRECT in the ->open function for
   the file. They should not set ->direct_IO, which is deprecated.

   If a filesystem wishes to perform its own work before direct I/O
   completion, it should call __iomap_dio_rw. If its return value is
   not an error pointer or a NULL pointer, the filesystem should pass
   the return value to iomap_dio_complete after finishing its
   internal work.

Return Values

   iomap_dio_rw can return one of the following:

       * A non-negative number of bytes transferred.
       * -ENOTBLK: Fall back to buffered I/O. iomap itself will
         return this value if it cannot invalidate the page cache
         before issuing the I/O to storage. The ->iomap_begin or
         ->iomap_end functions may also return this value.
       * -EIOCBQUEUED: The asynchronous direct I/O request has been
         queued and will be completed separately.
       * Any of the other negative error codes.

Direct Reads

   A direct I/O read initiates a read I/O from the storage device to
   the caller's buffer. Dirty parts of the pagecache are flushed to
   storage before initiating the read io. The flags value for
   ->iomap_begin will be IOMAP_DIRECT with any combination of the
   following enhancements:

       * IOMAP_NOWAIT, as defined previously.

   Callers commonly hold i_rwsem in shared mode before calling this
   function.

Direct Writes

   A direct I/O write initiates a write I/O to the storage device
   from the caller's buffer. Dirty parts of the pagecache are flushed
   to storage before initiating the write io. The pagecache is
   invalidated both before and after the write io. The flags value
   for ->iomap_begin will be IOMAP_DIRECT | IOMAP_WRITE with any
   combination of the following enhancements:

       * IOMAP_NOWAIT, as defined previously.
       * IOMAP_OVERWRITE_ONLY: Allocating blocks and zeroing partial
         blocks is not allowed. The entire file range must map to a
         single written or unwritten extent. The file I/O range must
         be aligned to the filesystem block size if the mapping is
         unwritten and the filesystem cannot handle zeroing the
         unaligned regions without exposing stale contents.

   Callers commonly hold i_rwsem in shared or exclusive mode before
   calling this function.

struct iomap_dio_ops:

 struct iomap_dio_ops {
     void (*submit_io)(const struct iomap_iter *iter, struct bio *bio,
                       loff_t file_offset);
     int (*end_io)(struct kiocb *iocb, ssize_t size, int error,
                   unsigned flags);
     struct bio_set *bio_set;
 };

   The fields of this structure are as follows:

       * submit_io: iomap calls this function when it has constructed
         a struct bio object for the I/O requested, and wishes to
         submit it to the block device. If no function is provided,
         submit_bio will be called directly. Filesystems that would
         like to perform additional work before (e.g. data
         replication for btrfs) should implement this function.
       * end_io: This is called after the struct bio completes. This
         function should perform post-write conversions of unwritten
         extent mappings, handle write failures, etc. The flags
         argument may be set to a combination of the following:
            * IOMAP_DIO_UNWRITTEN: The mapping was unwritten, so the
              ioend should mark the extent as written.
            * IOMAP_DIO_COW: Writing to the space in the mapping
              required a copy on write operation, so the ioend should
              switch mappings.
       * bio_set: This allows the filesystem to provide a custom
         bio_set for allocating direct I/O bios. This enables
         filesystems to stash additional per-bio information for
         private use. If this field is NULL, generic struct bio
         objects will be used.

   Filesystems that want to perform extra work after an I/O
   completion should set a custom ->bi_end_io function via
   ->submit_io. Afterwards, the custom endio function must call
   iomap_dio_bio_end_io to finish the direct I/O.

                                DAX I/O

   Some storage devices can be directly mapped as memory. These
   devices support a new access mode known as "fsdax" that allows
   loads and stores through the CPU and memory controller.

fsdax Reads

   A fsdax read performs a memcpy from storage device to the caller's
   buffer. The flags value for ->iomap_begin will be IOMAP_DAX with
   any combination of the following enhancements:

       * IOMAP_NOWAIT, as defined previously.

   Callers commonly hold i_rwsem in shared mode before calling this
   function.

fsdax Writes

   A fsdax write initiates a memcpy to the storage device from the
   caller's buffer. The flags value for ->iomap_begin will be
   IOMAP_DAX | IOMAP_WRITE with any combination of the following
   enhancements:

       * IOMAP_NOWAIT, as defined previously.
       * IOMAP_OVERWRITE_ONLY: The caller requires a pure overwrite
         to be performed from this mapping. This requires the
         filesystem extent mapping to already exist as an
         IOMAP_MAPPED type and span the entire range of the write I/O
         request. If the filesystem cannot map this request in a way
         that allows the iomap infrastructure to perform a pure
         overwrite, it must fail the mapping operation with -EAGAIN.

   Callers commonly hold i_rwsem in exclusive mode before calling
   this function.

  fsdax mmap Faults

   The dax_iomap_fault function handles read and write faults to
   fsdax storage. For a read fault, IOMAP_DAX | IOMAP_FAULT will be
   passed as the flags argument to ->iomap_begin. For a write fault,
   IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE will be passed as the flags
   argument to ->iomap_begin.

   Callers commonly hold the same locks as they do to call their
   iomap pagecache counterparts.

fsdax Truncation, fallocate, and Unsharing

   For fsdax files, the following functions are provided to replace
   their iomap pagecache I/O counterparts. The flags argument to
   ->iomap_begin are the same as the pagecache counterparts, with
   IOMAP_DAX added.

       * dax_file_unshare
       * dax_zero_range
       * dax_truncate_page

   Callers commonly hold the same locks as they do to call their
   iomap pagecache counterparts.

fsdax Deduplication

   Filesystems implementing the FIDEDUPERANGE ioctl must call the
   dax_remap_file_range_prep function with their own iomap read ops.

                             Seeking Files

   iomap implements the two iterating whence modes of the llseek
   system call.

SEEK_DATA

   The iomap_seek_data function implements the SEEK_DATA "whence"
   value for llseek. IOMAP_REPORT will be passed as the flags
   argument to ->iomap_begin.

   For unwritten mappings, the pagecache will be searched. Regions of
   the pagecache with a folio mapped and uptodate fsblocks within
   those folios will be reported as data areas.

   Callers commonly hold i_rwsem in shared mode before calling this
   function.

SEEK_HOLE

   The iomap_seek_hole function implements the SEEK_HOLE "whence"
   value for llseek. IOMAP_REPORT will be passed as the flags
   argument to ->iomap_begin.

   For unwritten mappings, the pagecache will be searched. Regions of
   the pagecache with no folio mapped, or a !uptodate fsblock within
   a folio will be reported as sparse hole areas.

   Callers commonly hold i_rwsem in shared mode before calling this
   function.

                          Swap File Activation

   The iomap_swapfile_activate function finds all the base-page
   aligned regions in a file and sets them up as swap space. The file
   will be fsync()'d before activation. IOMAP_REPORT will be passed
   as the flags argument to ->iomap_begin. All mappings must be
   mapped or unwritten; cannot be dirty or shared, and cannot span
   multiple block devices. Callers must hold i_rwsem in exclusive
   mode; this is already provided by swapon.

                      File Space Mapping Reporting

   iomap implements two of the file space mapping system calls.

FS_IOC_FIEMAP

   The iomap_fiemap function exports file extent mappings to
   userspace in the format specified by the FS_IOC_FIEMAP ioctl.
   IOMAP_REPORT will be passed as the flags argument to
   ->iomap_begin. Callers commonly hold i_rwsem in shared mode before
   calling this function.

FIBMAP (deprecated)

   iomap_bmap implements FIBMAP. The calling conventions are the same
   as for FIEMAP. This function is only provided to maintain
   compatibility for filesystems that implemented FIBMAP prior to
   conversion. This ioctl is deprecated; do not add a FIBMAP
   implementation to filesystems that do not have it. Callers should
   probably hold i_rwsem in shared mode before calling this function,
   but this is unclear.