[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <3c01efe6-007a-4422-ad79-0bad3af281b1@huaweicloud.com>
Date: Thu, 24 Oct 2024 15:44:03 +0800
From: Zhang Yi <yi.zhang@...weicloud.com>
To: sedat.dilek@...il.com
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
jack@...e.cz, ritesh.list@...il.com, hch@...radead.org, djwong@...nel.org,
david@...morbit.com, zokeefe@...gle.com, yi.zhang@...wei.com,
chengzhihao1@...wei.com, yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O
path and enable large folio
On 2024/10/23 20:13, Sedat Dilek wrote:
> On Tue, Oct 22, 2024 at 11:22 AM Zhang Yi <yi.zhang@...weicloud.com> wrote:
>>
>> On 2024/10/22 14:59, Sedat Dilek wrote:
>>> On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@...weicloud.com> wrote:
>>>>
>>>> From: Zhang Yi <yi.zhang@...wei.com>
>>>>
>>>> Hello!
>>>>
>>>> This patch series is the latest version based on my previous RFC
>>>> series[1], which converts the buffered I/O path of ext4 regular files to
>>>> iomap and enables large folios. After several months of work, almost all
>>>> preparatory changes have been upstreamed, thanks a lot for the review
>>>> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
>>>> time for the main implementation of this conversion.
>>>>
>>>> This series is the main part of iomap buffered iomap conversion, it's
>>>> based on 6.12-rc4, and the code context is also depend on my anohter
>>>> cleanup series[1] (I've put that in this seris so we can merge it
>>>> directly), fixed all minor bugs found in my previous RFC v4 series.
>>>> Additionally, I've update change logs in each patch and also includes
>>>> some code modifications as Dave's suggestions. This series implements
>>>> the core iomap APIs on ext4 and introduces a mount option called
>>>> "buffered_iomap" to enable the iomap buffered I/O path. We have already
>>>> supported the default features, default mount options and bigalloc
>>>> feature. However, we do not yet support online defragmentation, inline
>>>> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
>>>> to buffered_head I/O path automatically if you use those features and
>>>> options. Some of these features should be supported gradually in the
>>>> near future.
>>>>
>>>> Most of the implementations resemble the original buffered_head path;
>>>> however, there are four key differences.
>>>>
>>>> 1. The first aspect is the block allocation in the writeback path. The
>>>> iomap frame will invoke ->map_blocks() at least once for each dirty
>>>> folio. To ensure optimal writeback performance, we aim to allocate a
>>>> range of delalloc blocks that is as long as possible within the
>>>> writeback length for each invocation. In certain situations, we may
>>>> allocate a range of blocks that exceeds the amount we will actually
>>>> write back. Therefore,
>>>> 1) we cannot allocate a written extent for those blocks because it may
>>>> expose stale data in such short write cases. Instead, we should
>>>> allocate an unwritten extent, which means we must always enable the
>>>> dioread_nolock option. This change could also bring many other
>>>> benefits.
>>>> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>>>> process, based on the actual written length. This approach can also
>>>> prevent the exposure of zero data, which may occur if there is a
>>>> power failure during an append write.
>>>> 3) We do not need to pre-split extents during write-back, we can
>>>> postpone this task until the end I/O process while converting
>>>> unwritten extents.
>>>>
>>>> 2. The second reason is that since we always allocate unwritten space
>>>> for new blocks, there is no risk of exposing stale data. As a result,
>>>> we do not need to order the data, which allows us to disable the
>>>> data=ordered mode. Consequently, we also do not require the reserved
>>>> handle when converting the unwritten extent in the final I/O worker,
>>>> we can directly start with the normal handle.
>>>>
>>>> Series details:
>>>>
>>>> Patch 1-10 is just another series of mine that refactors the fallocate
>>>> functions[1]. This series relies on the code context of that but has no
>>>> logical dependencies. I put this here just for easy access and merge.
>>>>
>>>> Patch 11-21 implement the iomap buffered read/write path, dirty folio
>>>> write back path and mmap path for ext4 regular file.
>>>>
>>>> Patch 22-23 disable the unsupported online-defragmentation function and
>>>> disable the changing of the inode journal flag to data=journal mode.
>>>> Please look at the following patch for details.
>>>>
>>>> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
>>>> default now) to partially enable the iomap buffered I/O path and also
>>>> enable large folio.
>>>>
>>>>
>>>> About performance:
>>>>
>>>> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
>>>> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>>>>
>>>> fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>>>> -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>>>> -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>>>> -group_reportin -name=$name --output=/tmp/test_log
>>>>
>>>
>>> Hi Zhang Yi,
>>>
>>> can you clarify about the FIO values for the diverse parameters?
>>>
>>
>> Hi Sedat,
>>
>> Sure, the test I present here is a simple single-thread and single-I/O
>> depth case with psync ioengine. Most of the FIO parameters are shown
>> in the tables below.
>>
>
> Hi Zhang Yi,
>
> Thanks for your reply.
>
> Can you share a FIO config file with all (relevant) settings?
> Maybe it is in the below link?
>
> Link: https://packages.debian.org/sid/all/fio-examples/filelist
No, I didn't have this configuration file. I simply wrote two straightforward
scripts to do this test. This serves as a reference, primarily used for
performance analysis in basic read/write operations with different backends.
More complex cases should be adjusted based on the actual circumstances.
I have attached the scripts, feel free to use them. I suggest adjusting the
parameters according to your machine configuration and service I/O model.
>
>> For the rest, the 'iodepth' and 'numjobs' are always set to 1 and the
>> 'size' is 40GB. During the write cache test, I also disable the write
>> back process through:
>>
>> echo 0 > /proc/sys/vm/dirty_writeback_centisecs
>> echo 100 > /proc/sys/vm/dirty_background_ratio
>> echo 100 > /proc/sys/vm/dirty_ratio
>>
>
> ^^ Ist this info in one of the patches? If not - can you add this info
> to the next version's cover-letter?
>
> The patchset and improvements are valid only for powerful servers or
> has a notebook user any benefits of this?
The performance improvement is primarily attributed to the cost savings of
the kernel software stack with large I/O. Therefore, when the CPU becomes a
bottleneck, performance should improves, i.e. the faster the disk, the more
pronounced the benefits, regardless of whether the system is a server or a
notebook.
Thanks,
Yi.
> If you have benchmark data, please share this.
>
> I can NOT promise if I will give that patchset a try.
>
> Best thanks.
>
> Best regards,
> -Sedat-
>
>> Thanks,
>> Yi.
>>
>>>
>>>> == buffer read ==
>>>>
>>>> buffer_head iomap + large folio
>>>> type bs IOPS BW(MiB/s) IOPS BW(MiB/s)
>>>> -------------------------------------------------------
>>>> hole 4K 576k 2253 762k 2975 +32%
>>>> hole 64K 48.7k 3043 77.8k 4860 +60%
>>>> hole 1M 2960 2960 4942 4942 +67%
>>>> ramdisk 4K 443k 1732 530k 2069 +19%
>>>> ramdisk 64K 34.5k 2156 45.6k 2850 +32%
>>>> ramdisk 1M 2093 2093 2841 2841 +36%
>>>> nvme 4K 339k 1323 364k 1425 +8%
>>>> nvme 64K 23.6k 1471 25.2k 1574 +7%
>>>> nvme 1M 2012 2012 2153 2153 +7%
>>>>
>>>>
>>>> == buffer write ==
>>>>
>>>> buffer_head iomap + large folio
>>>> type Overwrite Sync Writeback bs IOPS BW IOPS BW(MiB/s)
>>>> ----------------------------------------------------------------------
>>>> cache N N N 4K 417k 1631 440k 1719 +5%
>>>> cache N N N 64K 33.4k 2088 81.5k 5092 +144%
>>>> cache N N N 1M 2143 2143 5716 5716 +167%
>>>> cache Y N N 4K 449k 1755 469k 1834 +5%
>>>> cache Y N N 64K 36.6k 2290 82.3k 5142 +125%
>>>> cache Y N N 1M 2352 2352 5577 5577 +137%
>>>> ramdisk N N Y 4K 365k 1424 354k 1384 -3%
>>>> ramdisk N N Y 64K 31.2k 1950 74.2k 4640 +138%
>>>> ramdisk N N Y 1M 1968 1968 5201 5201 +164%
>>>> ramdisk N Y N 4K 9984 39 12.9k 51 +29%
>>>> ramdisk N Y N 64K 5936 371 8960 560 +51%
>>>> ramdisk N Y N 1M 1050 1050 1835 1835 +75%
>>>> ramdisk Y N Y 4K 411k 1609 443k 1731 +8%
>>>> ramdisk Y N Y 64K 34.1k 2134 77.5k 4844 +127%
>>>> ramdisk Y N Y 1M 2248 2248 5372 5372 +139%
>>>> ramdisk Y Y N 4K 182k 711 186k 730 +3%
>>>> ramdisk Y Y N 64K 18.7k 1170 34.7k 2171 +86%
>>>> ramdisk Y Y N 1M 1229 1229 2269 2269 +85%
>>>> nvme N N Y 4K 373k 1458 387k 1512 +4%
>>>> nvme N N Y 64K 29.2k 1827 70.9k 4431 +143%
>>>> nvme N N Y 1M 1835 1835 4919 4919 +168%
>>>> nvme N Y N 4K 11.7k 46 11.7k 46 0%
>>>> nvme N Y N 64K 6453 403 8661 541 +34%
>>>> nvme N Y N 1M 649 649 1351 1351 +108%
>>>> nvme Y N Y 4K 372k 1456 433k 1693 +16%
>>>> nvme Y N Y 64K 33.0k 2064 74.7k 4669 +126%
>>>> nvme Y N Y 1M 2131 2131 5273 5273 +147%
>>>> nvme Y Y N 4K 56.7k 222 56.4k 220 -1%
>>>> nvme Y Y N 64K 13.4k 840 19.4k 1214 +45%
>>>> nvme Y Y N 1M 714 714 1504 1504 +111%
>>>>
>>>> Thanks,
>>>> Yi.
>>>>
>>>> Major changes since RFC v4:
>>>> - Disable unsupported online defragmentation, do not fall back to
>>>> buffer_head path.
>>>> - Wite and wait data back while doing partial block truncate down to
>>>> fix a stale data problem.
>>>> - Disable the online changing of the inode journal flag to data=journal
>>>> mode.
>>>> - Since iomap can zero out dirty pages with unwritten extent, do not
>>>> write data before zeroing out in ext4_zero_range(), and also do not
>>>> zero partial blocks under a started journal handle.
>>>>
>>>> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>>>>
>>>> ---
>>>> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
>>>> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
>>>> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
>>>> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>>>>
>>>>
>>>> Zhang Yi (27):
>>>> ext4: remove writable userspace mappings before truncating page cache
>>>> ext4: don't explicit update times in ext4_fallocate()
>>>> ext4: don't write back data before punch hole in nojournal mode
>>>> ext4: refactor ext4_punch_hole()
>>>> ext4: refactor ext4_zero_range()
>>>> ext4: refactor ext4_collapse_range()
>>>> ext4: refactor ext4_insert_range()
>>>> ext4: factor out ext4_do_fallocate()
>>>> ext4: move out inode_lock into ext4_fallocate()
>>>> ext4: move out common parts into ext4_fallocate()
>>>> ext4: use reserved metadata blocks when splitting extent on endio
>>>> ext4: introduce seq counter for the extent status entry
>>>> ext4: add a new iomap aops for regular file's buffered IO path
>>>> ext4: implement buffered read iomap path
>>>> ext4: implement buffered write iomap path
>>>> ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>>>> ext4: implement writeback iomap path
>>>> ext4: implement mmap iomap path
>>>> ext4: do not always order data when partial zeroing out a block
>>>> ext4: do not start handle if unnecessary while partial zeroing out a
>>>> block
>>>> ext4: implement zero_range iomap path
>>>> ext4: disable online defrag when inode using iomap buffered I/O path
>>>> ext4: disable inode journal mode when using iomap buffered I/O path
>>>> ext4: partially enable iomap for the buffered I/O path of regular
>>>> files
>>>> ext4: enable large folio for regular file with iomap buffered I/O path
>>>> ext4: change mount options code style
>>>> ext4: introduce a mount option for iomap buffered I/O path
>>>>
>>>> fs/ext4/ext4.h | 17 +-
>>>> fs/ext4/ext4_jbd2.c | 3 +-
>>>> fs/ext4/ext4_jbd2.h | 8 +
>>>> fs/ext4/extents.c | 568 +++++++++++----------------
>>>> fs/ext4/extents_status.c | 13 +-
>>>> fs/ext4/file.c | 19 +-
>>>> fs/ext4/ialloc.c | 5 +
>>>> fs/ext4/inode.c | 755 ++++++++++++++++++++++++++++++------
>>>> fs/ext4/move_extent.c | 7 +
>>>> fs/ext4/page-io.c | 105 +++++
>>>> fs/ext4/super.c | 185 ++++-----
>>>> include/trace/events/ext4.h | 57 +--
>>>> 12 files changed, 1153 insertions(+), 589 deletions(-)
>>>>
>>>> --
>>>> 2.46.1
>>>>
>>>>
>>
View attachment "ext4_iomap_test_read.sh" of type "text/plain" (2451 bytes)
View attachment "ext4_iomap_test_write.sh" of type "text/plain" (3203 bytes)
Powered by blists - more mailing lists