linux-kernel - [GIT PULL] vfs blocksize

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240913-vfs-blocksize-ab40822b2366@brauner>
Date: Thu, 19 Sep 2024 15:49:53 +0200
From: Christian Brauner <brauner@...nel.org>
To: Linus Torvalds <torvalds@...ux-foundation.org>
Cc: Christian Brauner <brauner@...nel.org>,
	linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org
Subject: [GIT PULL] vfs blocksize

Hey Linus,

Now that the large folio/xarry bug is sorted this one seems ready.

/* Summary */
This contains the vfs infrastructure as well as the xfs bits to enable
support for block sizes (bs) larger than page sizes (ps) plus a few
fixes to related infrastructure.

There has been efforts over the last 16 years to enable enable Large
Block Sizes (LBS), that is block sizes in filesystems where bs > page
size. Through these efforts we have learned that one of the main
blockers to supporting bs > ps in filesystems has been a way to allocate
pages that are at least the filesystem block size on the page cache
where bs > ps.

Thanks to various previous efforts it is possible to support bs > ps in
XFS with only a few changes in XFS itself. Most changes are to the page
cache to support minimum order folio support for the target block size
on the filesystem.

A motivation for Large Block Sizes today is to support high-capacity
(large amount of Terabytes) QLC SSDs where the internal Indirection Unit
(IU) are typically greater than 4k to help reduce DRAM and so in turn
cost and space. In practice this then allows different architectures to
use a base page size of 4k while still enabling support for block sizes
aligned to the larger IUs by relying on high order folios on the page
cache when needed.

It also allows to take advantage of the drive's support for atomics
larger than 4k with buffered IO support in Linux. As described this year
at LSFMM, supporting large atomics greater than 4k enables databases to
remove the need to rely on their own journaling, so they can disable
double buffered writes, which is a feature different cloud providers are
already enabling through custom storage solutions.

/* Testing */

gcc version 14.2.0 (Debian 14.2.0-3)
Debian clang version 16.0.6 (27+b1)

All patches are based on v6.11-rc1 and have been sitting in linux-next.
No build failures or warnings were observed.

A lot of emphasis has been put on testing using kdevops, starting with
an XFS baseline [1]. The testing has been split into regression and
progression.

The whole test suite was run to check for regressions on existing
profiles due to the page cache changes.

The split_huge_page_test selftest on XFS filesystem was also run to
check for huge page splits in min order chunks is done correctly.

No regressions were found with these patches added on top.

8k, 16k, 32k and 64k block sizes were used during feature testing. To
compare it with existing support, an ARM VM with 64k base page system
without the patches was used as a reference to check for actual failures
due to LBS support in a 4k base page size system.

No new failures were found with the LBS support.

Some preliminary performance tests with fio on XFS on 4k block size
against pmem and NVMe with buffered IO and Direct IO on vanilla vs these
patches applied was done. There were no regressions detected.

sysbench on postgres and mysql for several hours was run on LBS XFS
without any issues.

There's also an eBPF tool called blkalgn [2] to see if IO sent to the
device is aligned and at least filesystem block size in length.

[1] https://github.com/linux-kdevops/kdevops/blob/master/docs/xfs-bugs.md
[2] https://github.com/iovisor/bcc/pull/4813

/* Conflicts */

Merge conflicts with mainline
=============================

No known conflicts.

Merge conflicts with other trees
================================

No known conflicts.

The following changes since commit 8400291e289ee6b2bf9779ff1c83a291501f017b:

  Linux 6.11-rc1 (2024-07-28 14:19:55 -0700)

are available in the Git repository at:

  git@...olite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs tags/vfs-6.12.blocksize

for you to fetch changes up to 71fdfcdd0dc8344ce6a7887b4675c7700efeffa6:

  Documentation: iomap: fix a typo (2024-09-12 14:07:17 +0200)

Please consider pulling these changes from the signed vfs-6.12.blocksize tag.

Thanks!
Christian

----------------------------------------------------------------
vfs-6.12.blocksize

----------------------------------------------------------------
Brian Foster (2):
      iomap: fix handling of dirty folios over unwritten extents
      iomap: make zero range flush conditional on unwritten mappings

Christian Brauner (2):
      Merge patch series "enable bs > ps in XFS"
      Merge patch series "iomap: flush dirty cache over unwritten mappings on zero range"

Christoph Hellwig (5):
      iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release
      iomap: improve shared block detection in iomap_unshare_iter
      iomap: pass flags to iomap_file_buffered_write_punch_delalloc
      iomap: pass the iomap to the punch callback
      iomap: remove the iomap_file_buffered_write_punch_delalloc return value

Dave Chinner (1):
      xfs: use kvmalloc for xattr buffers

Dennis Lam (1):
      docs:filesystems: fix spelling and grammar mistakes in iomap design page

Josef Bacik (1):
      iomap: add a private argument for iomap_file_buffered_write

Luis Chamberlain (2):
      mm: split a folio in minimum folio order chunks
      iomap: remove set_memor_ro() on zero page

Matthew Wilcox (Oracle) (1):
      fs: Allow fine-grained control of folio sizes

Pankaj Raghav (9):
      filemap: allocate mapping_min_order folios in the page cache
      readahead: allocate folios with mapping_min_order in readahead
      filemap: cap PTE range to be created to allowed zero fill in folio_map_range()
      iomap: fix iomap_dio_zero() for fs bs > system page size
      xfs: expose block size in stat
      xfs: make the calculation generic in xfs_sb_validate_fsb_count()
      xfs: enable block size larger than page size support
      filemap: fix htmldoc warning for mapping_align_index()
      Documentation: iomap: fix a typo

 Documentation/filesystems/iomap/design.rst |   8 +-
 block/fops.c                               |   2 +-
 fs/gfs2/file.c                             |   2 +-
 fs/iomap/buffered-io.c                     | 199 ++++++++++++++++++-----------
 fs/iomap/direct-io.c                       |  42 +++++-
 fs/xfs/libxfs/xfs_attr_leaf.c              |  15 +--
 fs/xfs/libxfs/xfs_ialloc.c                 |   5 +
 fs/xfs/libxfs/xfs_shared.h                 |   3 +
 fs/xfs/xfs_file.c                          |   2 +-
 fs/xfs/xfs_icache.c                        |   6 +-
 fs/xfs/xfs_iomap.c                         |  19 +--
 fs/xfs/xfs_iops.c                          |  12 +-
 fs/xfs/xfs_mount.c                         |   8 +-
 fs/xfs/xfs_super.c                         |  28 ++--
 fs/zonefs/file.c                           |   2 +-
 include/linux/huge_mm.h                    |  28 +++-
 include/linux/iomap.h                      |  13 +-
 include/linux/pagemap.h                    | 124 ++++++++++++++++--
 mm/filemap.c                               |  36 ++++--
 mm/huge_memory.c                           |  65 +++++++++-
 mm/readahead.c                             |  83 +++++++++---
 21 files changed, 506 insertions(+), 196 deletions(-)