lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CA+icZUWKGBKOxEaSUJv4up46b0i8=R-RgbnpHEV20HC_210syw@mail.gmail.com>
Date: Tue, 22 Oct 2024 08:59:48 +0200
From: Sedat Dilek <sedat.dilek@...il.com>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org, 
	linux-kernel@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca, 
	jack@...e.cz, ritesh.list@...il.com, hch@...radead.org, djwong@...nel.org, 
	david@...morbit.com, zokeefe@...gle.com, yi.zhang@...wei.com, 
	chengzhihao1@...wei.com, yukuai3@...wei.com, yangerkun@...wei.com
Subject: Re: [PATCH 00/27] ext4: use iomap for regular file's buffered I/O
 path and enable large folio

On Tue, Oct 22, 2024 at 5:13 AM Zhang Yi <yi.zhang@...weicloud.com> wrote:
>
> From: Zhang Yi <yi.zhang@...wei.com>
>
> Hello!
>
> This patch series is the latest version based on my previous RFC
> series[1], which converts the buffered I/O path of ext4 regular files to
> iomap and enables large folios. After several months of work, almost all
> preparatory changes have been upstreamed, thanks a lot for the review
> and comments from Jan, Dave, Christoph, Darrick and Ritesh. Now it is
> time for the main implementation of this conversion.
>
> This series is the main part of iomap buffered iomap conversion, it's
> based on 6.12-rc4, and the code context is also depend on my anohter
> cleanup series[1] (I've put that in this seris so we can merge it
> directly), fixed all minor bugs found in my previous RFC v4 series.
> Additionally, I've update change logs in each patch and also includes
> some code modifications as Dave's suggestions. This series implements
> the core iomap APIs on ext4 and introduces a mount option called
> "buffered_iomap" to enable the iomap buffered I/O path. We have already
> supported the default features, default mount options and bigalloc
> feature. However, we do not yet support online defragmentation, inline
> data, fs_verify, fs_crypt, ext3, and data=journal mode, ext4 will fall
> to buffered_head I/O path automatically if you use those features and
> options. Some of these features should be supported gradually in the
> near future.
>
> Most of the implementations resemble the original buffered_head path;
> however, there are four key differences.
>
> 1. The first aspect is the block allocation in the writeback path. The
>    iomap frame will invoke ->map_blocks() at least once for each dirty
>    folio. To ensure optimal writeback performance, we aim to allocate a
>    range of delalloc blocks that is as long as possible within the
>    writeback length for each invocation. In certain situations, we may
>    allocate a range of blocks that exceeds the amount we will actually
>    write back. Therefore,
> 1) we cannot allocate a written extent for those blocks because it may
>    expose stale data in such short write cases. Instead, we should
>    allocate an unwritten extent, which means we must always enable the
>    dioread_nolock option. This change could also bring many other
>    benefits.
> 2) We should postpone updating the 'i_disksize' until the end of the I/O
>    process, based on the actual written length. This approach can also
>    prevent the exposure of zero data, which may occur if there is a
>    power failure during an append write.
> 3) We do not need to pre-split extents during write-back, we can
>    postpone this task until the end I/O process while converting
>    unwritten extents.
>
> 2. The second reason is that since we always allocate unwritten space
>    for new blocks, there is no risk of exposing stale data. As a result,
>    we do not need to order the data, which allows us to disable the
>    data=ordered mode. Consequently, we also do not require the reserved
>    handle when converting the unwritten extent in the final I/O worker,
>    we can directly start with the normal handle.
>
> Series details:
>
> Patch 1-10 is just another series of mine that refactors the fallocate
> functions[1]. This series relies on the code context of that but has no
> logical dependencies. I put this here just for easy access and merge.
>
> Patch 11-21 implement the iomap buffered read/write path, dirty folio
> write back path and mmap path for ext4 regular file.
>
> Patch 22-23 disable the unsupported online-defragmentation function and
> disable the changing of the inode journal flag to data=journal mode.
> Please look at the following patch for details.
>
> Patch 24-27 introduce "buffered_iomap" mount option (is not enabled by
> default now) to partially enable the iomap buffered I/O path and also
> enable large folio.
>
>
> About performance:
>
> Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU with
> 400GB system ram, 200GB ramdisk and 4TB nvme ssd disk.
>
>  fio -directory=/mnt -direct=0 -iodepth=$iodepth -fsync=$sync -rw=$rw \
>      -numjobs=${numjobs} -bs=${bs} -ioengine=psync -size=$size \
>      -runtime=60 -norandommap=0 -fallocate=none -overwrite=$overwrite \
>      -group_reportin -name=$name --output=/tmp/test_log
>

Hi Zhang Yi,

can you clarify about the FIO values for the diverse parameters?

Thanks.

BR,
-Sedat-

>  == buffer read ==
>
>                 buffer_head        iomap + large folio
>  type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>  -------------------------------------------------------
>  hole     4K    576k    2253       762k    2975     +32%
>  hole     64K   48.7k   3043       77.8k   4860     +60%
>  hole     1M    2960    2960       4942    4942     +67%
>  ramdisk  4K    443k    1732       530k    2069     +19%
>  ramdisk  64K   34.5k   2156       45.6k   2850     +32%
>  ramdisk  1M    2093    2093       2841    2841     +36%
>  nvme     4K    339k    1323       364k    1425     +8%
>  nvme     64K   23.6k   1471       25.2k   1574     +7%
>  nvme     1M    2012    2012       2153    2153     +7%
>
>
>  == buffer write ==
>
>                                        buffer_head  iomap + large folio
>  type   Overwrite Sync Writeback  bs   IOPS   BW    IOPS   BW(MiB/s)
>  ----------------------------------------------------------------------
>  cache      N    N    N    4K     417k    1631    440k    1719   +5%
>  cache      N    N    N    64K    33.4k   2088    81.5k   5092   +144%
>  cache      N    N    N    1M     2143    2143    5716    5716   +167%
>  cache      Y    N    N    4K     449k    1755    469k    1834   +5%
>  cache      Y    N    N    64K    36.6k   2290    82.3k   5142   +125%
>  cache      Y    N    N    1M     2352    2352    5577    5577   +137%
>  ramdisk    N    N    Y    4K     365k    1424    354k    1384   -3%
>  ramdisk    N    N    Y    64K    31.2k   1950    74.2k   4640   +138%
>  ramdisk    N    N    Y    1M     1968    1968    5201    5201   +164%
>  ramdisk    N    Y    N    4K     9984    39      12.9k   51     +29%
>  ramdisk    N    Y    N    64K    5936    371     8960    560    +51%
>  ramdisk    N    Y    N    1M     1050    1050    1835    1835   +75%
>  ramdisk    Y    N    Y    4K     411k    1609    443k    1731   +8%
>  ramdisk    Y    N    Y    64K    34.1k   2134    77.5k   4844   +127%
>  ramdisk    Y    N    Y    1M     2248    2248    5372    5372   +139%
>  ramdisk    Y    Y    N    4K     182k    711     186k    730    +3%
>  ramdisk    Y    Y    N    64K    18.7k   1170    34.7k   2171   +86%
>  ramdisk    Y    Y    N    1M     1229    1229    2269    2269   +85%
>  nvme       N    N    Y    4K     373k    1458    387k    1512   +4%
>  nvme       N    N    Y    64K    29.2k   1827    70.9k   4431   +143%
>  nvme       N    N    Y    1M     1835    1835    4919    4919   +168%
>  nvme       N    Y    N    4K     11.7k   46      11.7k   46      0%
>  nvme       N    Y    N    64K    6453    403     8661    541    +34%
>  nvme       N    Y    N    1M     649     649     1351    1351   +108%
>  nvme       Y    N    Y    4K     372k    1456    433k    1693   +16%
>  nvme       Y    N    Y    64K    33.0k   2064    74.7k   4669   +126%
>  nvme       Y    N    Y    1M     2131    2131    5273    5273   +147%
>  nvme       Y    Y    N    4K     56.7k   222     56.4k   220    -1%
>  nvme       Y    Y    N    64K    13.4k   840     19.4k   1214   +45%
>  nvme       Y    Y    N    1M     714     714     1504    1504   +111%
>
> Thanks,
> Yi.
>
> Major changes since RFC v4:
>  - Disable unsupported online defragmentation, do not fall back to
>    buffer_head path.
>  - Wite and wait data back while doing partial block truncate down to
>    fix a stale data problem.
>  - Disable the online changing of the inode journal flag to data=journal
>    mode.
>  - Since iomap can zero out dirty pages with unwritten extent, do not
>    write data before zeroing out in ext4_zero_range(), and also do not
>    zero partial blocks under a started journal handle.
>
> [1] https://lore.kernel.org/linux-ext4/20241010133333.146793-1-yi.zhang@huawei.com/
>
> ---
> RFC v4: https://lore.kernel.org/linux-ext4/20240410142948.2817554-1-yi.zhang@huaweicloud.com/
> RFC v3: https://lore.kernel.org/linux-ext4/20240127015825.1608160-1-yi.zhang@huaweicloud.com/
> RFC v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> RFC v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
>
>
> Zhang Yi (27):
>   ext4: remove writable userspace mappings before truncating page cache
>   ext4: don't explicit update times in ext4_fallocate()
>   ext4: don't write back data before punch hole in nojournal mode
>   ext4: refactor ext4_punch_hole()
>   ext4: refactor ext4_zero_range()
>   ext4: refactor ext4_collapse_range()
>   ext4: refactor ext4_insert_range()
>   ext4: factor out ext4_do_fallocate()
>   ext4: move out inode_lock into ext4_fallocate()
>   ext4: move out common parts into ext4_fallocate()
>   ext4: use reserved metadata blocks when splitting extent on endio
>   ext4: introduce seq counter for the extent status entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: do not always order data when partial zeroing out a block
>   ext4: do not start handle if unnecessary while partial zeroing out a
>     block
>   ext4: implement zero_range iomap path
>   ext4: disable online defrag when inode using iomap buffered I/O path
>   ext4: disable inode journal mode when using iomap buffered I/O path
>   ext4: partially enable iomap for the buffered I/O path of regular
>     files
>   ext4: enable large folio for regular file with iomap buffered I/O path
>   ext4: change mount options code style
>   ext4: introduce a mount option for iomap buffered I/O path
>
>  fs/ext4/ext4.h              |  17 +-
>  fs/ext4/ext4_jbd2.c         |   3 +-
>  fs/ext4/ext4_jbd2.h         |   8 +
>  fs/ext4/extents.c           | 568 +++++++++++----------------
>  fs/ext4/extents_status.c    |  13 +-
>  fs/ext4/file.c              |  19 +-
>  fs/ext4/ialloc.c            |   5 +
>  fs/ext4/inode.c             | 755 ++++++++++++++++++++++++++++++------
>  fs/ext4/move_extent.c       |   7 +
>  fs/ext4/page-io.c           | 105 +++++
>  fs/ext4/super.c             | 185 ++++-----
>  include/trace/events/ext4.h |  57 +--
>  12 files changed, 1153 insertions(+), 589 deletions(-)
>
> --
> 2.46.1
>
>

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ