linux-kernel - Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4efb3689-c7e0-4f7f-b557-aead4c628851@huawei.com>
Date: Thu, 5 Feb 2026 11:04:23 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Zhang Yi <yi.zhang@...weicloud.com>, Jan Kara <jack@...e.cz>
CC: Theodore Tso <tytso@....edu>, Christoph Hellwig <hch@...radead.org>,
	<linux-ext4@...r.kernel.org>, <linux-fsdevel@...r.kernel.org>,
	<linux-kernel@...r.kernel.org>, <adilger.kernel@...ger.ca>,
	<ojaswin@...ux.ibm.com>, <ritesh.list@...il.com>, <djwong@...nel.org>, Zhang
 Yi <yi.zhang@...wei.com>, <yizhang089@...il.com>, <yangerkun@...wei.com>,
	<yukuai@...-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com>,
	<libaokun9@...il.com>, Baokun Li <libaokun1@...wei.com>
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
 buffered I/O path

On 2026-02-05 10:06, Zhang Yi wrote:
> On 2/4/2026 10:23 PM, Jan Kara wrote:
>> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>>> On 2026-02-03 21:14, Theodore Tso wrote:
>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>>> This means that the ordered journal mode is no longer in ext4 used
>>>>> under the iomap infrastructure.  The main reason is that iomap
>>>>> processes each folio one by one during writeback. It first holds the
>>>>> folio lock and then starts a transaction to create the block mapping.
>>>>> If we still use the ordered mode, we need to perform writeback in
>>>>> the logging process, which may require initiating a new transaction,
>>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>>> mode indeed has many synchronization dependencies, which increase
>>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>>
>>>>> Currently, there are three scenarios where the ordered mode is used:
>>>>> 1) append write,
>>>>> 2) partial block truncate down, and
>>>>> 3) online defragmentation.
>>>>>
>>>>> For append write, we can always allocate unwritten blocks to avoid
>>>>> using the ordered journal mode.
>>>> This is going to be a pretty severe performance regression, since it
>>>> means that we will be doubling the journal load for append writes.
>>>> What we really need to do here is to first write out the data blocks,
>>>> and then only start the transaction handle to modify the data blocks
>>>> *after* the data blocks have been written (to heretofore, unused
>>>> blocks that were just allocated).  It means inverting the order in
>>>> which we write data blocks for the append write case, and in fact it
>>>> will improve fsync() performance since we won't be gating writing the
>>>> commit block on the date blocks getting written out in the append
>>>> write case.
>>> I have some local demo patches doing something similar, and I think this
>>> work could be decoupled from Yi's patch set.
>>>
>>> Since inode preallocation (PA) maintains physical block occupancy with a
>>> logical-to-physical mapping, and ensures on-disk data consistency after
>>> power failure, it is an excellent location for recording temporary
>>> occupancy. Furthermore, since inode PA often allocates more blocks than
>>> requested, it can also help reduce file fragmentation.
>>>
>>> The specific approach is as follows:
>>>
>>> 1. Allocate only the PA during block allocation without inserting it into
>>>    the extent status tree. Return the PA to the caller and increment its
>>>    refcount to prevent it from being discarded.
>>>
>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>>    refcount and return -EIO. If successful, proceed to the next step.
>>>
>>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>>    extents. Release the refcount and update the extent tree.
>>>
>>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>>    release the old extent before inserting the new one.
>> Sounds good. Just if I understand correctly case 4 would happen only if you
>> really try to do something like COW with this? Normally you'd just use the
>> already present blocks and write contents into them?
>>
>>> This ensures data atomicity, while jbd2—being a COW-like implementation
>>> itself—ensures metadata atomicity. By leveraging this "delay map"
>>> mechanism, we can achieve several benefits:
>>>
>>>  * Lightweight, high-performance COW.
>>>  * High-performance software atomic writes (hardware-independent).
>>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>>  * Replacing ordered data and data journal modes.
>>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>>  * Paving the way for snapshot support.
>>>
>>> Of course, COW itself can lead to severe file fragmentation, especially
>>> in small-scale overwrite scenarios.
>> I agree the feature can provide very interesting benefits and we were
>> pondering about something like that for a long time, just never got to
>> implementing it. I'd say the immediate benefits are you can completely get
>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
>> code complexity should not increase much. We could also mostly get rid of
>> data=ordered mode use (although not completely - see my discussion with
>> Zhang over patch 3) which would be also welcome simplification. These
> I suppose this feature can also be used to get rid of the data=ordered mode
> use in online defragmentation. With this feature, perhaps we can develop a
> new method of online defragmentation that eliminates the need to pre-allocate
> a donor file.  Instead, we can attempt to allocate as many contiguous blocks
> as possible through PA. If the allocated length is longer than the original
> extent, we can perform the swap and copy the data. Once the copy is complete,
> we can atomically construct a new extent, then releases the original blocks
> synchronously or asynchronously, similar to a regular copy-on-write (COW)
> operation. What does this sounds?
>
> Regards,
> Yi.

Good idea! This is much more efficient than allocating files first and
then swapping them. While COW can exacerbate fragmentation, it can also
be leveraged for defragmentation.

We could monitor the average extent length of files within the kernel and
add those that fall below a certain threshold to a "pending defrag" list.
Defragmentation could then be triggered at an appropriate time. To ensure
the effectiveness of the defrag process, we could also set a minimum
length requirement for inode PAs.


Cheers,
Baokun

>> benefits alone are IMO a good enough reason to have the functionality :).
>> Even without COW, atomic writes and other fancy stuff.
>>
>> I don't see how you want to get rid of data=journal mode - perhaps that's
>> related to the COW functionality?
>>
>> 								Honza