linux-ext4 - Re: [PATCH] Documentation: document the design of iomap and how to port

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240610141808.vdsflgcbjmgc37dt@quack3>
Date: Mon, 10 Jun 2024 16:18:08 +0200
From: Jan Kara <jack@...e.cz>
To: "Darrick J. Wong" <djwong@...nel.org>
Cc: Christoph Hellwig <hch@...radead.org>,
	"Ritesh Harjani (IBM)" <ritesh.list@...il.com>,
	linux-ext4@...r.kernel.org, linux-xfs@...r.kernel.org,
	linux-fsdevel@...r.kernel.org, Dave Chinner <david@...morbit.com>,
	Matthew Wilcox <willy@...radead.org>,
	Christian Brauner <brauner@...nel.org>,
	Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>,
	Luis Chamberlain <mcgrof@...nel.org>
Subject: Re: [PATCH] Documentation: document the design of iomap and how to
 port

On Sun 09-06-24 08:55:06, Darrick J. Wong wrote:
>        * invalidate_lock: The pagecache struct address_space
>          rwsemaphore that protects against folio removal.

invalidate_lock lock is held for read during insertions and for write
during removals. So holding it pro read indeed protects against folio
removal but holding it for write protects against folio insertion (which
some places also use).

>        * validity_cookie is a magic freshness value set by the
>          filesystem that should be used to detect stale mappings. For
>          pagecache operations this is critical for correct operation
>          because page faults can occur, which implies that filesystem
>          locks should not be held between ->iomap_begin and
>          ->iomap_end. Filesystems with completely static mappings
>          need not set this value. Only pagecache operations
>          revalidate mappings.
> 
>          XXX: Should fsdax revalidate as well?

AFAICT no. DAX is more like using direct IO for everything. So no writeback
changing mapping state behind your back (and that's the only thing that is
not serialized with i_rwsem or invalidate_lock). Maybe this fact can be
mentioned somewhere around the discussion of iomap_valid() as a way how
locking usually works out?

>    iomap implements nearly all the folio and pagecache management
>    that filesystems once had to implement themselves. This means that
>    the filesystem need not know the details of allocating, mapping,
>    managing uptodate and dirty state, or writeback of pagecache
>    folios. Unless the filesystem explicitly opts in to buffer heads,
>    they will not be used, which makes buffered I/O much more
>    efficient, and willy much happier.
		    ^^^ unless we make it a general noun for someone doing
thankless neverending conversion job, we should give him a capital W ;).

>    These struct kiocb flags are significant for buffered I/O with
>    iomap:
> 
>        * IOCB_NOWAIT: Only proceed with the I/O if mapping data are
>          already in memory, we do not have to initiate other I/O, and
>          we acquire all filesystem locks without blocking. Neither
>          this flag nor its definition RWF_NOWAIT actually define what
>          this flag means, so this is the best the author could come
>          up with.

RWF_NOWAIT is a performance feature, not a correctness one, hence the
meaning is somewhat vague. It is meant to mean "do the IO only if it
doesn't involve waiting for other IO or other time expensive operations".
Generally we translate it to "don't wait for i_rwsem, page locks, don't do
block allocation, etc." OTOH we don't bother to specialcase internal
filesystem locks (such as EXT4_I(inode)->i_data_sem) and we get away with
it because blocking on it under constraints we generally perform RWF_NOWAIT
IO is exceedingly rare.

>       mmap Write Faults
> 
>    The iomap_page_mkwrite function handles a write fault to a folio
>    the pagecache.
     ^^^ to a folio *in* the pagecache? I cannot quite parse the sentence.

>       Truncation
> 
>    Filesystems can call iomap_truncate_page to zero the bytes in the
>    pagecache from EOF to the end of the fsblock during a file
>    truncation operation. truncate_setsize or truncate_pagecache will
>    take care of everything after the EOF block. IOMAP_ZERO will be
>    passed as the flags argument to ->iomap_begin. Callers typically
>    take i_rwsem and invalidate_lock in exclusive mode.

Hum, but i_rwsem and invalidate_lock are usually acquired *before*
iomap_truncate_page() is even called, aren't they? This locking note looks
a bit confusing to me. I'd rather write: "The callers typically hold i_rwsem
and invalidate_lock when calling iomap_truncate_page()." if you want to
mention any locking.

>       Zeroing for File Operations
> 
>    Filesystems can call iomap_zero_range to perform zeroing of the
>    pagecache for non-truncation file operations that are not aligned
>    to the fsblock size. IOMAP_ZERO will be passed as the flags
>    argument to ->iomap_begin. Callers typically take i_rwsem and
>    invalidate_lock in exclusive mode.

Ditto here...

>       Unsharing Reflinked File Data
> 
>    Filesystems can call iomap_file_unshare to force a file sharing
>    storage with another file to preemptively copy the shared data to
>    newly allocate storage. IOMAP_WRITE | IOMAP_UNSHARE will be passed
>    as the flags argument to ->iomap_begin. Callers typically take
>    i_rwsem and invalidate_lock in exclusive mode.

And here.

>   Direct I/O
> 
>    In Linux, direct I/O is defined as file I/O that is issued
>    directly to storage, bypassing the pagecache.
> 
>    The iomap_dio_rw function implements O_DIRECT (direct I/O) reads
>    and writes for files. An optional ops parameter can be passed to
>    change the behavior of direct I/O. The done_before parameter
>    should be set if writes have been initiated prior to the call. The
>    direction of the I/O is determined from the iocb passed in.
> 
>    The flags argument can be any of the following values:
> 
>        * IOMAP_DIO_FORCE_WAIT: Wait for the I/O to complete even if
>          the kiocb is not synchronous.
> 
>        * IOMAP_DIO_OVERWRITE_ONLY: Allocating blocks, zeroing partial
>          blocks, and extensions of the file size are not allowed. The
>          entire file range must to map to a single written or
				  ^^ extra "to"

>          unwritten extent. This flag exists to enable issuing
>          concurrent direct IOs with only the shared i_rwsem held when
>          the file I/O range is not aligned to the filesystem block
>          size. -EAGAIN will be returned if the operation cannot
>          proceed.

<snip>

>     Direct Writes
> 
>    A direct I/O write initiates a write I/O to the storage device to
>    the caller's buffer. Dirty parts of the pagecache are flushed to
>    storage before initiating the write io. The pagecache is
>    invalidated both before and after the write io. The flags value
>    for ->iomap_begin will be IOMAP_DIRECT | IOMAP_WRITE with any
>    combination of the following enhancements:
> 
>        * IOMAP_NOWAIT: Write if mapping data are already in memory.
>          Does not initiate other I/O or block on filesystem locks.
> 
>        * IOMAP_OVERWRITE_ONLY: Allocating blocks and zeroing partial
>          blocks is not allowed. The entire file range must to map to
							     ^^ extra "to"

>          a single written or unwritten extent. The file I/O range
>          must be aligned to the filesystem block size.

This seems to be XFS specific thing? At least I don't see anything in
generic iomap code depending on this?

>     fsdax Writes
> 
>    A fsdax write initiates a memcpy to the storage device to the
							    ^^ from

>    caller's buffer. The flags value for ->iomap_begin will be
>    IOMAP_DAX | IOMAP_WRITE with any combination of the following
>    enhancements:
> 
>        * IOMAP_NOWAIT: Write if mapping data are already in memory.
>          Does not initiate other I/O or block on filesystem locks.
> 
>        * IOMAP_OVERWRITE_ONLY: Allocating blocks and zeroing partial
>          blocks is not allowed. The entire file range must to map to
							     ^^ extra "to"

>          a single written or unwritten extent. The file I/O range
>          must be aligned to the filesystem block size.
> 
>    Callers commonly hold i_rwsem in exclusive mode.
> 
>     mmap Faults
> 
>    The dax_iomap_fault function handles read and write faults to
>    fsdax storage. For a read fault, IOMAP_DAX | IOMAP_FAULT will be
>    passed as the flags argument to ->iomap_begin. For a write fault,
>    IOMAP_DAX | IOMAP_FAULT | IOMAP_WRITE will be passed as the flags
>    argument to ->iomap_begin.
> 
>    Callers commonly hold the same locks as they do to call their
>    iomap pagecache counterparts.
> 
>     Truncation, fallocate, and Unsharing
> 
>    For fsdax files, the following functions are provided to replace
>    their iomap pagecache I/O counterparts. The flags argument to
>    ->iomap_begin are the same as the pagecache counterparts, with
>    IOMAP_DIO added.
	  ^^^ IOMAP_DAX?

>        * dax_file_unshare
> 
>        * dax_zero_range
> 
>        * dax_truncate_page
> 
>    Callers commonly hold the same locks as they do to call their
>    iomap pagecache counterparts.

>   How to Convert to iomap?
> 
>    First, add #include <linux/iomap.h> from your source code and add
>    select FS_IOMAP to your filesystem's Kconfig option. Build the
>    kernel, run fstests with the -g all option across a wide variety
>    of your filesystem's supported configurations to build a baseline
>    of which tests pass and which ones fail.
> 
>    The recommended approach is first to implement ->iomap_begin (and
>    ->iomap->end if necessary) to allow iomap to obtain a read-only
       ^^^^ ->iomap_end

<snip>

>    Most likely at this point, the buffered read and write paths will
>    still to be converted. The mapping functions should all work
          ^^ need to be

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR