linux-kernel - Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <eldlhdvhc4sdlmfed5omg6huv5rl6m7ummstlygh2bownaejqn@bykrybkyywzp>
Date: Wed, 4 Feb 2026 15:23:46 +0100
From: Jan Kara <jack@...e.cz>
To: Baokun Li <libaokun1@...wei.com>
Cc: Theodore Tso <tytso@....edu>, Zhang Yi <yi.zhang@...weicloud.com>, 
	Christoph Hellwig <hch@...radead.org>, linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, adilger.kernel@...ger.ca, jack@...e.cz, ojaswin@...ux.ibm.com, 
	ritesh.list@...il.com, djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, 
	yizhang089@...il.com, yangerkun@...wei.com, 
	yukuai@...-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com, libaokun9@...il.com
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
 buffered I/O path

On Wed 04-02-26 09:59:36, Baokun Li wrote:
> On 2026-02-03 21:14, Theodore Tso wrote:
> > On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
> >> This means that the ordered journal mode is no longer in ext4 used
> >> under the iomap infrastructure.  The main reason is that iomap
> >> processes each folio one by one during writeback. It first holds the
> >> folio lock and then starts a transaction to create the block mapping.
> >> If we still use the ordered mode, we need to perform writeback in
> >> the logging process, which may require initiating a new transaction,
> >> potentially leading to deadlock issues. In addition, ordered journal
> >> mode indeed has many synchronization dependencies, which increase
> >> the risk of deadlocks, and I believe this is one of the reasons why
> >> ext4_do_writepages() is implemented in such a complicated manner.
> >> Therefore, I think we need to give up using the ordered data mode.
> >>
> >> Currently, there are three scenarios where the ordered mode is used:
> >> 1) append write,
> >> 2) partial block truncate down, and
> >> 3) online defragmentation.
> >>
> >> For append write, we can always allocate unwritten blocks to avoid
> >> using the ordered journal mode.
> > This is going to be a pretty severe performance regression, since it
> > means that we will be doubling the journal load for append writes.
> > What we really need to do here is to first write out the data blocks,
> > and then only start the transaction handle to modify the data blocks
> > *after* the data blocks have been written (to heretofore, unused
> > blocks that were just allocated).  It means inverting the order in
> > which we write data blocks for the append write case, and in fact it
> > will improve fsync() performance since we won't be gating writing the
> > commit block on the date blocks getting written out in the append
> > write case.
> 
> I have some local demo patches doing something similar, and I think this
> work could be decoupled from Yi's patch set.
> 
> Since inode preallocation (PA) maintains physical block occupancy with a
> logical-to-physical mapping, and ensures on-disk data consistency after
> power failure, it is an excellent location for recording temporary
> occupancy. Furthermore, since inode PA often allocates more blocks than
> requested, it can also help reduce file fragmentation.
> 
> The specific approach is as follows:
> 
> 1. Allocate only the PA during block allocation without inserting it into
>    the extent status tree. Return the PA to the caller and increment its
>    refcount to prevent it from being discarded.
> 
> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>    refcount and return -EIO. If successful, proceed to the next step.
> 
> 3. Start a handle upon successful IO completion to convert the inode PA to
>    extents. Release the refcount and update the extent tree.
> 
> 4. If a corresponding extent already exists, we’ll need to punch holes to
>    release the old extent before inserting the new one.

Sounds good. Just if I understand correctly case 4 would happen only if you
really try to do something like COW with this? Normally you'd just use the
already present blocks and write contents into them?

> This ensures data atomicity, while jbd2—being a COW-like implementation
> itself—ensures metadata atomicity. By leveraging this "delay map"
> mechanism, we can achieve several benefits:
> 
>  * Lightweight, high-performance COW.
>  * High-performance software atomic writes (hardware-independent).
>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>  * Replacing ordered data and data journal modes.
>  * Reduced handle hold time, as it's only held during extent tree updates.
>  * Paving the way for snapshot support.
> 
> Of course, COW itself can lead to severe file fragmentation, especially
> in small-scale overwrite scenarios.

I agree the feature can provide very interesting benefits and we were
pondering about something like that for a long time, just never got to
implementing it. I'd say the immediate benefits are you can completely get
rid of dioread_nolock as well as the legacy dioread_lock modes so overall
code complexity should not increase much. We could also mostly get rid of
data=ordered mode use (although not completely - see my discussion with
Zhang over patch 3) which would be also welcome simplification. These
benefits alone are IMO a good enough reason to have the functionality :).
Even without COW, atomic writes and other fancy stuff.

I don't see how you want to get rid of data=journal mode - perhaps that's
related to the COW functionality?

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR