[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <77c14b3e-33f9-4a00-83a4-0467f73a7625@huaweicloud.com>
Date: Tue, 3 Feb 2026 17:18:10 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: Christoph Hellwig <hch@...radead.org>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, ojaswin@...ux.ibm.com, ritesh.list@...il.com,
djwong@...nel.org, Zhang Yi <yi.zhang@...wei.com>, yi.zhang@...weicloud.com,
yizhang089@...il.com, libaokun1@...wei.com, yangerkun@...wei.com,
yukuai@...as.com
Subject: Re: [PATCH -next v2 00/22] ext4: use iomap for regular file's
buffered I/O path
Hi, Christoph!
On 2/3/2026 2:43 PM, Christoph Hellwig wrote:
>> Original Cover (Updated):
>
> This really should always be first. The updates are rather minor
> compared to the overview that the cover letter provides.
>
>> Key notes on the iomap implementations in this series.
>> - Don't use ordered data mode to prevent exposing stale data when
>> performing append write and truncating down.
>
> I can't parse this.
Thank you for looking into this series, and sorry for the lack of
clarity. The reasons of these key notes have been described in
detail in patch 12-13.
This means that the ordered journal mode is no longer in ext4 used
under the iomap infrastructure. The main reason is that iomap
processes each folio one by one during writeback. It first holds the
folio lock and then starts a transaction to create the block mapping.
If we still use the ordered mode, we need to perform writeback in
the logging process, which may require initiating a new transaction,
potentially leading to deadlock issues. In addition, ordered journal
mode indeed has many synchronization dependencies, which increase
the risk of deadlocks, and I believe this is one of the reasons why
ext4_do_writepages() is implemented in such a complicated manner.
Therefore, I think we need to give up using the ordered data mode.
Currently, there are three scenarios where the ordered mode is used:
1) append write,
2) partial block truncate down, and
3) online defragmentation.
For append write, we can always allocate unwritten blocks to avoid
using the ordered journal mode. For partial block truncate down, we
can explicitly perform a write-back. The third case is the only one
that will be somewhat more complex. It needs to use the ordered mode
to ensure the atomicity of data copying and extents exchange when
exchanging extents and copying data between two files, preventing
data loss. Considering performance, we cannot explicitly perform a
writeback for each extent exchange. I have not yet thought of a
simple way to handle this. This will require consideration of other
solutions when supporting online defragmentation in the future.
>
>> - Override dioread_nolock mount option, always allocate unwritten
>> extents for new blocks.
>
> Why do you override it?
There are two reasons:
The first one is the previously mentioned reason of not using
ordered journal mode. To prevent exposing stale data during a power
failure that occurs while performing append writes, unwritten
extents are always requested for newly allocated blocks.
The second one is to consider performance during writeback. When
doing writeback, we should allocate blocks as long as possible when
first calling ->writeback_range() based on the writeback length,
rather than mapping each folio individually. Therefore, to avoid the
situation where more blocks are allocated than actually written
(which could cause fsck to complain), we cannot directly allocate
written blocks before performing writeback.
>
>> - When performing write back, don't use reserved journal handle and
>> postponing updating i_disksize until I/O is done.
>
> Again missing the why and the implications.
The reserved journal handle is used to solve deadlock issues in
transaction dependencies when writeback occurs in ordered journal
mode. This mechanism is no longer necessary if the ordered mode is
not used.
>
>> buffered write
>> ==============
>>
>> buffer_head:
>> bs write cache uncached write
>> 1k 423 MiB/s 36.3 MiB/s
>> 4k 1067 MiB/s 58.4 MiB/s
>> 64k 4321 MiB/s 869 MiB/s
>> 1M 4640 MiB/s 3158 MiB/s
>>
>> iomap:
>> bs write cache uncached write
>> 1k 403 MiB/s 57 MiB/s
>> 4k 1093 MiB/s 61 MiB/s
>> 64k 6488 MiB/s 1206 MiB/s
>> 1M 7378 MiB/s 4818 MiB/s
>
> This would read better if you actually compated buffered_head
> vs iomap side by side.
>
> What is the bs? The read unit size? I guess not the file system
> block size as some of the values are too large for that.
The 'bs' is the read/write unit size, and the fs block size is the
default 4KB.
>
> Looks like iomap is faster, often much faster except for the
> 1k cached case, where it is slightly slower. Do you have
> any idea why?
I observed the on-cpu flame graph. I think the main reason is the
buffer_head loop path detects the folio and buffer_head status.
It saves the uptodate flag in the buffer_head structure when the
first 1KB write for each 4KB folio, it doesn't need to get blocks
for the remaining three writes. However, the iomap infrastructure
always call ->iomap_begin() to acquire the mapping info for each
1KB write. Although the first call to ->iomap_begin() has already
allocated the block extent, there are still some overheads due to
synchronization operations such as locking when subsequent calls
are made. The smaller the unit size, the greater the impact, and
this will also have a greater impact on pure cache writes than on
uncached writes.
>
>> buffered read
>> =============
>>
>> buffer_head:
>> bs read hole read cache read data
>> 1k 635 MiB/s 661 MiB/s 605 MiB/s
>> 4k 1987 MiB/s 2128 MiB/s 1761 MiB/s
>> 64k 6068 MiB/s 9472 MiB/s 4475 MiB/s
>> 1M 5471 MiB/s 8657 MiB/s 4405 MiB/s
>>
>> iomap:
>> bs read hole read cache read data
>> 1k 643 MiB/s 653 MiB/s 602 MiB/s
>> 4k 2075 MiB/s 2159 MiB/s 1716 MiB/s
>> 64k 6267 MiB/s 9545MiB/s 4451 MiB/s
>> 1M 6072 MiB/s 9191MiB/s 4467 MiB/s
>
> What is read cache vs read data here?
>
The 'read cache' means that preread is set to 1 during fio tests,
causing it to read cached data. In contrast, the 'read data'
preread is set to 0, so it always reads data directly from the
disk.
Thanks,
Yi.
> Otherwise same comments as for the write case.
>
Powered by blists - more mailing lists