linux-kernel - Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <9666679c-c9f7-435c-8b67-c67c2f0c19ab@huawei.com>
Date: Wed, 4 Feb 2026 09:59:36 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Theodore Tso <tytso@....edu>, Zhang Yi <yi.zhang@...weicloud.com>
CC: Christoph Hellwig <hch@...radead.org>, <linux-ext4@...r.kernel.org>,
	<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<adilger.kernel@...ger.ca>, <jack@...e.cz>, <ojaswin@...ux.ibm.com>,
	<ritesh.list@...il.com>, <djwong@...nel.org>, Zhang Yi <yi.zhang@...wei.com>,
	<yizhang089@...il.com>, <yangerkun@...wei.com>,
	<yukuai@...-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com>,
	<libaokun9@...il.com>, Baokun Li <libaokun1@...wei.com>
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
 buffered I/O path

On 2026-02-03 21:14, Theodore Tso wrote:
> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>> This means that the ordered journal mode is no longer in ext4 used
>> under the iomap infrastructure.  The main reason is that iomap
>> processes each folio one by one during writeback. It first holds the
>> folio lock and then starts a transaction to create the block mapping.
>> If we still use the ordered mode, we need to perform writeback in
>> the logging process, which may require initiating a new transaction,
>> potentially leading to deadlock issues. In addition, ordered journal
>> mode indeed has many synchronization dependencies, which increase
>> the risk of deadlocks, and I believe this is one of the reasons why
>> ext4_do_writepages() is implemented in such a complicated manner.
>> Therefore, I think we need to give up using the ordered data mode.
>>
>> Currently, there are three scenarios where the ordered mode is used:
>> 1) append write,
>> 2) partial block truncate down, and
>> 3) online defragmentation.
>>
>> For append write, we can always allocate unwritten blocks to avoid
>> using the ordered journal mode.
> This is going to be a pretty severe performance regression, since it
> means that we will be doubling the journal load for append writes.
> What we really need to do here is to first write out the data blocks,
> and then only start the transaction handle to modify the data blocks
> *after* the data blocks have been written (to heretofore, unused
> blocks that were just allocated).  It means inverting the order in
> which we write data blocks for the append write case, and in fact it
> will improve fsync() performance since we won't be gating writing the
> commit block on the date blocks getting written out in the append
> write case.

I have some local demo patches doing something similar, and I think this
work could be decoupled from Yi's patch set.

Since inode preallocation (PA) maintains physical block occupancy with a
logical-to-physical mapping, and ensures on-disk data consistency after
power failure, it is an excellent location for recording temporary
occupancy. Furthermore, since inode PA often allocates more blocks than
requested, it can also help reduce file fragmentation.

The specific approach is as follows:

1. Allocate only the PA during block allocation without inserting it into
   the extent status tree. Return the PA to the caller and increment its
   refcount to prevent it from being discarded.

2. Issue IOs to the blocks within the inode PA. If IO fails, release the
   refcount and return -EIO. If successful, proceed to the next step.

3. Start a handle upon successful IO completion to convert the inode PA to
   extents. Release the refcount and update the extent tree.

4. If a corresponding extent already exists, we’ll need to punch holes to
   release the old extent before inserting the new one.

This ensures data atomicity, while jbd2—being a COW-like implementation
itself—ensures metadata atomicity. By leveraging this "delay map"
mechanism, we can achieve several benefits:

 * Lightweight, high-performance COW.
 * High-performance software atomic writes (hardware-independent).
 * Replacing dio_readnolock, which might otherwise read unexpected zeros.
 * Replacing ordered data and data journal modes.
 * Reduced handle hold time, as it's only held during extent tree updates.
 * Paving the way for snapshot support.

Of course, COW itself can lead to severe file fragmentation, especially
in small-scale overwrite scenarios.

Perhaps I’ve overlooked something. What are your thoughts?

Regards,
Baokun