lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240212061842.GB6180@frogsfrogsfrogs>
Date: Sun, 11 Feb 2024 22:18:42 -0800
From: "Darrick J. Wong" <djwong@...nel.org>
To: Zhang Yi <yi.zhang@...weicloud.com>
Cc: linux-ext4@...r.kernel.org, linux-fsdevel@...r.kernel.org,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org, tytso@....edu,
	adilger.kernel@...ger.ca, jack@...e.cz, ritesh.list@...il.com,
	hch@...radead.org, willy@...radead.org, zokeefe@...gle.com,
	yi.zhang@...wei.com, chengzhihao1@...wei.com, yukuai3@...wei.com,
	wangkefeng.wang@...wei.com
Subject: Re: [RFC PATCH v3 00/26] ext4: use iomap for regular file's buffered
 IO path and enable large foilo

On Sat, Jan 27, 2024 at 09:57:59AM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@...wei.com>
> 
> Hello,
> 
> This is the third version of RFC patch series that convert ext4 regular
> file's buffered IO path to iomap and enable large folio. It's rebased on
> 6.7 and Christoph's "map multiple blocks per ->map_blocks in iomap
> writeback" series [1]. I've fixed all issues found in the last about 3
> weeks of stress tests and fault injection tests in v2. I hope I've
> covered most of the corner cases, and any comments are welcome. :)
> 
> Changes since v2:
>  - Update patch 1-6 to v3 [2].
>  - iomap_zero and iomap_unshare don't need to update i_size and call
>    iomap_write_failed(), introduce a new helper iomap_write_end_simple()
>    to avoid doing that.
>  - Factor out ext4_[ext|ind]_map_blocks() parts from ext4_map_blocks(),
>    introduce a new helper ext4_iomap_map_one_extent() to allocate
>    delalloc blocks in writeback, which is always under i_data_sem in
>    write mode. This is done to prevent the writing back delalloc
>    extents become stale if it raced by truncate.
>  - Add a lock detection in mapping_clear_large_folios().
> Changes since v1:
>  - Introduce seq count for iomap buffered write and writeback to protect
>    races from extents changes, e.g. truncate, mwrite.
>  - Always allocate unwritten extents for new blocks, drop dioread_lock
>    mode, and make no distinctions between dioread_lock and
>    dioread_nolock.
>  - Don't add ditry data range to jinode, drop data=ordered mode, and
>    make no distinctions between data=ordered and data=writeback mode.
>  - Postpone updating i_disksize to endio.
>  - Allow splitting extents and use reserved space in endio.
>  - Instead of reimplement a new delayed mapping helper
>    ext4_iomap_da_map_blocks() for buffer write, try to reuse
>    ext4_da_map_blocks().
>  - Add support for disabling large folio on active inodes.
>  - Support online defragmentation, make file fall back to buffer_head
>    and disable large folio in ext4_move_extents().
>  - Move ext4_nonda_switch() in advance to prevent deadlock in mwrite.
>  - Add dirty_len and pos trace info to trace_iomap_writepage_map().
>  - Update patch 1-6 to v2.
> 
> This series only support ext4 with the default features and mount
> options, doesn't support inline_data, bigalloc, dax, fs_verity, fs_crypt
> and data=journal mode, ext4 would fall back to buffer_head path

Do you plan to add bigalloc or !extents support as a part 2 patchset?

An ext2 port to iomap has been (vaguely) in the works for a while,
though iirc willy never got the performance to match because iomap
didn't have a mechanism for the caller to tell it "run the IO now even
though you don't have a complete page, because the indirect block is the
next block after the 11th block".

--D

> automatically if you enabled these features/options. Although it has
> many limitations now, it can satisfy the requirements of common cases
> and bring a great performance benefit.
> 
> Patch 1-6: this is a preparation series, it changes ext4_map_blocks()
> and ext4_set_iomap() to recognize delayed only extents, I've send it out
> separately [2].
> 
> Patch 7-8: these are two minor iomap changes, the first one is don't
> update i_size and don't call iomap_write_failed() in zero_range, the
> second one is for debug in iomap writeback path that I've discussed whit
> Christoph [3].
> 
> Patch 9-15: this is another preparation series, including some changes
> for delayed extents. Firstly, it factor out buffer_head from
> ext4_da_map_blocks(), make it to support adding multi-blocks once a
> time. Then make unwritten to written extents conversion in endio use to
> reserved space, reduce the risk of potential data loss. Finally,
> introduce a sequence counter for extent status tree, which is useful
> for iomap buffer write and write back.
> 
> Patch 16-22: Implement buffered IO iomap path for read, write, mmap,
> zero range, truncate and writeback, replace current buffered_head path.
> Please look at the following patch for details.
> 
> Patch 23-26: Convert to iomap for regular file's buffered IO path
> besides inline_data, bigalloc, dax, fs_verity, fs_crypt, and
> data=journal mode, and enable large folio. It should be note that
> buffered iomap path hasn't support Online defrag yet, so we need fall
> back to buffer_head and disable large folio automatically if user call
> EXT4_IOC_MOVE_EXT.
> 
> About Tests:
>  - kvm-xfstests in auto mode, and about 3 weeks of stress tests and
>    fault injection tests.
>  - A performance tests below.
> 
>    Fio tests with psync on my machine with Intel Xeon Gold 6240 CPU
>    with 400GB system ram, 200GB ramdisk and 1TB nvme ssd disk.
> 
>    == buffer read ==
> 
>                   buffer head        iomap with large folio
>    type     bs    IOPS    BW(MiB/s)  IOPS    BW(MiB/s)
>    ----------------------------------------------------
>    hole     4K    565k    2206       811k    3167
>    hole     64K   45.1k   2820       78.1k   4879
>    hole     1M    2744    2744       4890    4891
>    ramdisk  4K    436k    1703       554k    2163
>    ramdisk  64K   29.6k   1848       44.0k   2747
>    ramdisk  1M    1994    1995       2809    2809
>    nvme     4K    306k    1196       324k    1267
>    nvme     64K   19.3k   1208       24.3k   1517
>    nvme     1M    1694    1694       2256    2256
> 
>    == buffer write ==
> 
>                                        buffer head    ext4_iomap    
>    type   Overwrite Sync Writeback bs  IOPS   BW      IOPS   BW
>    -------------------------------------------------------------
>    cache    N       N    N         4K   395k   1544   415k   1621
>    cache    N       N    N         64K  30.8k  1928   80.1k  5005
>    cache    N       N    N         1M   1963   1963   5641   5642
>    cache    Y       N    N         4K   423k   1652   443k   1730
>    cache    Y       N    N         64K  33.0k  2063   80.8k  5051
>    cache    Y       N    N         1M   2103   2103   5588   5589
>    ramdisk  N       N    Y         4K   362k   1416   307k   1198
>    ramdisk  N       N    Y         64K  22.4k  1399   64.8k  4050
>    ramdisk  N       N    Y         1M   1670   1670   4559   4560
>    ramdisk  N       Y    N         4K   9830   38.4   13.5k  52.8
>    ramdisk  N       Y    N         64K  5834   365    10.1k  629
>    ramdisk  N       Y    N         1M   1011   1011   2064   2064
>    ramdisk  Y       N    Y         4K   397k   1550   409k   1598
>    ramdisk  Y       N    Y         64K  29.2k  1827   73.6k  4597
>    ramdisk  Y       N    Y         1M   1837   1837   4985   4985
>    ramdisk  Y       Y    N         4K   173k   675    182k   710
>    ramdisk  Y       Y    N         64K  17.7k  1109   33.7k  2105
>    ramdisk  Y       Y    N         1M   1128   1129   1790   1791
>    nvme     N       N    Y         4K   298k   1164   290k   1134
>    nvme     N       N    Y         64K  21.5k  1343   57.4k  3590
>    nvme     N       N    Y         1M   1308   1308   3664   3664
>    nvme     N       Y    N         4K   10.7k  41.8   12.0k  46.9
>    nvme     N       Y    N         64K  5962   373    8598   537
>    nvme     N       Y    N         1M   676    677    1417   1418
>    nvme     Y       N    Y         4K   366k   1430   373k   1456
>    nvme     Y       N    Y         64K  26.7k  1670   56.8k  3547
>    nvme     Y       N    Y         1M   1745   1746   3586   3586
>    nvme     Y       Y    N         4K   59.0k  230    61.2k  239
>    nvme     Y       Y    N         64K  13.0k  813    21.0k  1311
>    nvme     Y       Y    N         1M   683    683    1368   1369
>  
> TODO
>  - Keep on doing stress tests and fixing.
>  - I will rebase and resend my another patch set "ext4: more accurate
>    metadata reservaion for delalloc mount option[4]" later, it's useful
>    for iomap conversion. After this series, I suppose we could totally
>    drop ext4_nonda_switch() and prevent the risk of data loss caused by
>    extents splitting.
>  - Support for more features and mount options in the future.
> 
> [1] https://lore.kernel.org/linux-fsdevel/20231207072710.176093-1-hch@lst.de/
> [2] https://lore.kernel.org/linux-ext4/20240105033018.1665752-1-yi.zhang@huaweicloud.com/
> [3] https://lore.kernel.org/linux-fsdevel/20231207150311.GA18830@lst.de/
> [4] https://lore.kernel.org/linux-ext4/20230824092619.1327976-1-yi.zhang@huaweicloud.com/
> 
> Thanks,
> Yi.
> 
> ---
> v2: https://lore.kernel.org/linux-ext4/20240102123918.799062-1-yi.zhang@huaweicloud.com/
> v1: https://lore.kernel.org/linux-ext4/20231123125121.4064694-1-yi.zhang@huaweicloud.com/
> 
> Zhang Yi (26):
>   ext4: refactor ext4_da_map_blocks()
>   ext4: convert to exclusive lock while inserting delalloc extents
>   ext4: correct the hole length returned by ext4_map_blocks()
>   ext4: add a hole extent entry in cache after punch
>   ext4: make ext4_map_blocks() distinguish delalloc only extent
>   ext4: make ext4_set_iomap() recognize IOMAP_DELALLOC map type
>   iomap: don't increase i_size if it's not a write operation
>   iomap: add pos and dirty_len into trace_iomap_writepage_map
>   ext4: allow inserting delalloc extents with multi-blocks
>   ext4: correct delalloc extent length
>   ext4: also mark extent as delalloc if it's been unwritten
>   ext4: factor out bh handles to ext4_da_get_block_prep()
>   ext4: use reserved metadata blocks when splitting extent in endio
>   ext4: factor out ext4_map_{create|query}_blocks()
>   ext4: introduce seq counter for extent entry
>   ext4: add a new iomap aops for regular file's buffered IO path
>   ext4: implement buffered read iomap path
>   ext4: implement buffered write iomap path
>   ext4: implement writeback iomap path
>   ext4: implement mmap iomap path
>   ext4: implement zero_range iomap path
>   ext4: writeback partial blocks before zero range
>   ext4: fall back to buffer_head path for defrag
>   ext4: partially enable iomap for regular file's buffered IO path
>   filemap: support disable large folios on active inode
>   ext4: enable large folio for regular file with iomap buffered IO path
> 
>  fs/ext4/ext4.h              |  14 +-
>  fs/ext4/ext4_jbd2.c         |   6 +
>  fs/ext4/ext4_jbd2.h         |   7 +
>  fs/ext4/extents.c           | 149 +++---
>  fs/ext4/extents_status.c    |  39 +-
>  fs/ext4/extents_status.h    |   4 +-
>  fs/ext4/file.c              |  19 +-
>  fs/ext4/ialloc.c            |   5 +
>  fs/ext4/inode.c             | 891 +++++++++++++++++++++++++++---------
>  fs/ext4/move_extent.c       |  35 ++
>  fs/ext4/page-io.c           | 107 +++++
>  fs/ext4/super.c             |   3 +
>  fs/iomap/buffered-io.c      |  30 +-
>  fs/iomap/trace.h            |  43 +-
>  include/linux/pagemap.h     |  14 +
>  include/trace/events/ext4.h |  31 +-
>  mm/readahead.c              |   6 +-
>  17 files changed, 1109 insertions(+), 294 deletions(-)
> 
> -- 
> 2.39.2
> 
> 

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ