linux-kernel - Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's buffered I/O path

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <8729b45a-8052-41e3-b6eb-3d884097c670@huaweicloud.com>
Date: Fri, 6 Feb 2026 10:15:50 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Jan Kara <jack@...e.cz>
Cc: Baokun Li <libaokun1@...wei.com>, Theodore Tso <tytso@....edu>,
 Christoph Hellwig <hch@...radead.org>, linux-ext4@...r.kernel.org,
 linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org,
 adilger.kernel@...ger.ca, ojaswin@...ux.ibm.com, ritesh.list@...il.com,
 djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yizhang089@...il.com,
 yangerkun@...wei.com,
 yukuai@...-78bjiv52429oh8qptp.cn-shenzhen.alb.aliyuncs.com,
 libaokun9@...il.com
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
 buffered I/O path

On 2/5/2026 8:58 PM, Jan Kara wrote:
> On Thu 05-02-26 10:06:11, Zhang Yi wrote:
>> On 2/4/2026 10:23 PM, Jan Kara wrote:
>>> On Wed 04-02-26 09:59:36, Baokun Li wrote:
>>>> On 2026-02-03 21:14, Theodore Tso wrote:
>>>>> On Tue, Feb 03, 2026 at 05:18:10PM +0800, Zhang Yi wrote:
>>>>>> This means that the ordered journal mode is no longer in ext4 used
>>>>>> under the iomap infrastructure.  The main reason is that iomap
>>>>>> processes each folio one by one during writeback. It first holds the
>>>>>> folio lock and then starts a transaction to create the block mapping.
>>>>>> If we still use the ordered mode, we need to perform writeback in
>>>>>> the logging process, which may require initiating a new transaction,
>>>>>> potentially leading to deadlock issues. In addition, ordered journal
>>>>>> mode indeed has many synchronization dependencies, which increase
>>>>>> the risk of deadlocks, and I believe this is one of the reasons why
>>>>>> ext4_do_writepages() is implemented in such a complicated manner.
>>>>>> Therefore, I think we need to give up using the ordered data mode.
>>>>>>
>>>>>> Currently, there are three scenarios where the ordered mode is used:
>>>>>> 1) append write,
>>>>>> 2) partial block truncate down, and
>>>>>> 3) online defragmentation.
>>>>>>
>>>>>> For append write, we can always allocate unwritten blocks to avoid
>>>>>> using the ordered journal mode.
>>>>> This is going to be a pretty severe performance regression, since it
>>>>> means that we will be doubling the journal load for append writes.
>>>>> What we really need to do here is to first write out the data blocks,
>>>>> and then only start the transaction handle to modify the data blocks
>>>>> *after* the data blocks have been written (to heretofore, unused
>>>>> blocks that were just allocated).  It means inverting the order in
>>>>> which we write data blocks for the append write case, and in fact it
>>>>> will improve fsync() performance since we won't be gating writing the
>>>>> commit block on the date blocks getting written out in the append
>>>>> write case.
>>>>
>>>> I have some local demo patches doing something similar, and I think this
>>>> work could be decoupled from Yi's patch set.
>>>>
>>>> Since inode preallocation (PA) maintains physical block occupancy with a
>>>> logical-to-physical mapping, and ensures on-disk data consistency after
>>>> power failure, it is an excellent location for recording temporary
>>>> occupancy. Furthermore, since inode PA often allocates more blocks than
>>>> requested, it can also help reduce file fragmentation.
>>>>
>>>> The specific approach is as follows:
>>>>
>>>> 1. Allocate only the PA during block allocation without inserting it into
>>>>    the extent status tree. Return the PA to the caller and increment its
>>>>    refcount to prevent it from being discarded.
>>>>
>>>> 2. Issue IOs to the blocks within the inode PA. If IO fails, release the
>>>>    refcount and return -EIO. If successful, proceed to the next step.
>>>>
>>>> 3. Start a handle upon successful IO completion to convert the inode PA to
>>>>    extents. Release the refcount and update the extent tree.
>>>>
>>>> 4. If a corresponding extent already exists, we’ll need to punch holes to
>>>>    release the old extent before inserting the new one.
>>>
>>> Sounds good. Just if I understand correctly case 4 would happen only if you
>>> really try to do something like COW with this? Normally you'd just use the
>>> already present blocks and write contents into them?
>>>
>>>> This ensures data atomicity, while jbd2—being a COW-like implementation
>>>> itself—ensures metadata atomicity. By leveraging this "delay map"
>>>> mechanism, we can achieve several benefits:
>>>>
>>>>  * Lightweight, high-performance COW.
>>>>  * High-performance software atomic writes (hardware-independent).
>>>>  * Replacing dio_readnolock, which might otherwise read unexpected zeros.
>>>>  * Replacing ordered data and data journal modes.
>>>>  * Reduced handle hold time, as it's only held during extent tree updates.
>>>>  * Paving the way for snapshot support.
>>>>
>>>> Of course, COW itself can lead to severe file fragmentation, especially
>>>> in small-scale overwrite scenarios.
>>>
>>> I agree the feature can provide very interesting benefits and we were
>>> pondering about something like that for a long time, just never got to
>>> implementing it. I'd say the immediate benefits are you can completely get
>>> rid of dioread_nolock as well as the legacy dioread_lock modes so overall
>>> code complexity should not increase much. We could also mostly get rid of
>>> data=ordered mode use (although not completely - see my discussion with
>>> Zhang over patch 3) which would be also welcome simplification. These
>>
>> I suppose this feature can also be used to get rid of the data=ordered mode
>> use in online defragmentation. With this feature, perhaps we can develop a
>> new method of online defragmentation that eliminates the need to pre-allocate
>> a donor file.  Instead, we can attempt to allocate as many contiguous blocks
>> as possible through PA. If the allocated length is longer than the original
>> extent, we can perform the swap and copy the data. Once the copy is complete,
>> we can atomically construct a new extent, then releases the original blocks
>> synchronously or asynchronously, similar to a regular copy-on-write (COW)
>> operation. What does this sounds?
> 
> Well, the reason why defragmentation uses the donor file is that there can
> be a lot of policy in where and how the file is exactly placed (e.g. you
> might want to place multiple files together). It was decided it is too
> complex to implement these policies in the kernel so we've offloaded the
> decision where the file is placed to userspace. Back at those times we were
> also considering adding interface to guide allocation of blocks for a file
> so the userspace defragmenter could prepare donor file with desired blocks.

Indeed, it is easier to implement different strategies through donor files.

> But then the interest in defragmentation dropped (particularly due to
> advances in flash storage) and so these ideas never materialized.

As I understand it, defragmentation offers two primary benefits:

1. It improves the contiguity of file blocks, thereby enhancing read/write
   performance;
2. It reduces the overhead on the block allocator and the management of
   metadata.

As for the first point, indeed, this role has gradually diminished with the
development of flash memory devices. However, I believe the second point is
still very useful. For example, some of our customers have scenarios
involving large-capacity storage, where data is continuously written in a
cyclic manner. This results in the disk space usage remaining at a high level
for a long time, with a large number of both big and small files. Over time,
as fragmentation increases, the CPU usage of the mb_allocater will
significantly rise. Although this issue can be alleviated to some extent
through optimizations of the mb_allocater algorithm and the use of other
pre-allocation techniques, we still find online defragmentation to be very
necessary.

> 
> We might rethink the online defragmentation interface but at this point
> I'm not sure we are ready to completely replace the idea of guiding the
> block placement using a donor file...
> 
> 								Honza

Yeah, we can rethink it when supporting online defragmentation for the iomap
path.

Cheers,
Yi.