linux-kernel - Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4a210be6-eced-4a47-a54b-3f2bc3f3bfbf@huawei.com>
Date: Thu, 5 Feb 2026 10:55:59 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Jan Kara <jack@...e.cz>
CC: Theodore Tso <tytso@....edu>, Zhang Yi <yi.zhang@...weicloud.com>,
	Christoph Hellwig <hch@...radead.org>, <linux-ext4@...r.kernel.org>,
	<linux-fsdevel@...r.kernel.org>, <linux-kernel@...r.kernel.org>,
	<adilger.kernel@...ger.ca>, <ojaswin@...ux.ibm.com>, <ritesh.list@...il.com>,
	<djwong@...nel.org>, Zhang Yi <yi.zhang@...wei.com>, <yizhang089@...il.com>,
	<yangerkun@...wei.com>,
	<yukuai@...-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com>,
	<libaokun9@...il.com>, Baokun Li <libaokun1@...wei.com>
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
 buffered I/O path

On 2026-02-04 22:23, Jan Kara wrote:
> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>> On 2026-02-03 21:14, Theodore Tso wrote:
>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>> This means that the ordered journal mode is no longer in ext4 used
>>>> under the iomap infrastructure.  The main reason is that iomap
>>>> processes each folio one by one during writeback. It first holds the
>>>> folio lock and then starts a transaction to create the block mapping.
>>>> If we still use the ordered mode, we need to perform writeback in
>>>> the logging process, which may require initiating a new transaction,
>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>> mode indeed has many synchronization dependencies, which increase
>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>
>>>> Currently, there are three scenarios where the ordered mode is used:
>>>> 1) append write,
>>>> 2) partial block truncate down, and
>>>> 3) online defragmentation.
>>>>
>>>> For append write, we can always allocate unwritten blocks to avoid
>>>> using the ordered journal mode.
>>> This is going to be a pretty severe performance regression, since it
>>> means that we will be doubling the journal load for append writes.
>>> What we really need to do here is to first write out the data blocks,
>>> and then only start the transaction handle to modify the data blocks
>>> *after* the data blocks have been written (to heretofore, unused
>>> blocks that were just allocated).  It means inverting the order in
>>> which we write data blocks for the append write case, and in fact it
>>> will improve fsync() performance since we won't be gating writing the
>>> commit block on the date blocks getting written out in the append
>>> write case.
>> I have some local demo patches doing something similar, and I think this
>> work could be decoupled from Yi's patch set.
>>
>> Since inode preallocation (PA) maintains physical block occupancy with a
>> logical-to-physical mapping, and ensures on-disk data consistency after
>> power failure, it is an excellent location for recording temporary
>> occupancy. Furthermore, since inode PA often allocates more blocks than
>> requested, it can also help reduce file fragmentation.
>>
>> The specific approach is as follows:
>>
>> 1. Allocate only the PA during block allocation without inserting it into
>>    the extent status tree. Return the PA to the caller and increment its
>>    refcount to prevent it from being discarded.
>>
>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>    refcount and return -EIO. If successful, proceed to the next step.
>>
>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>    extents. Release the refcount and update the extent tree.
>>
>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>    release the old extent before inserting the new one.
> Sounds good. Just if I understand correctly case 4 would happen only if you
> really try to do something like COW with this? Normally you'd just use the
> already present blocks and write contents into them?

Yes, case 4 only needs to be considered when implementing COW.

>
>> This ensures data atomicity, while jbd2—being a COW-like implementation
>> itself—ensures metadata atomicity. By leveraging this "delay map"
>> mechanism, we can achieve several benefits:
>>
>>  * Lightweight, high-performance COW.
>>  * High-performance software atomic writes (hardware-independent).
>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>  * Replacing ordered data and data journal modes.
>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>  * Paving the way for snapshot support.
>>
>> Of course, COW itself can lead to severe file fragmentation, especially
>> in small-scale overwrite scenarios.
> I agree the feature can provide very interesting benefits and we were
> pondering about something like that for a long time, just never got to
> implementing it. I'd say the immediate benefits are you can completely get
> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
> code complexity should not increase much. We could also mostly get rid of
> data=ordered mode use (although not completely - see my discussion with
> Zhang over patch 3) which would be also welcome simplification. These
> benefits alone are IMO a good enough reason to have the functionality :).
> Even without COW, atomic writes and other fancy stuff.

Glad you liked the 'delay map' concept (naming suggestions are welcome!).

With delay-map in place, implementing COW only requires handling overwrite
scenarios, and software atomic writes can be achieved by enabling atomic
delay-maps across multiple PAs.

I expect to send out a minimal RFC version for discussion in a few weeks.

I will share some additional thoughts regarding EOF blocks and
data=ordered mode in patch 3.

Thanks for your feedback!

>
> I don't see how you want to get rid of data=journal mode - perhaps that's
> related to the COW functionality?
>
> 								Honza

Yes. The only real advantage of data=journal mode over data=ordered is
its guarantee of data atomicity for overwrites.

If we can achieve this through COW-based software atomic writes, we can
move away from the performance-heavy data=journal mode.


Cheers,
Baokun