[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250618111544.22602-2-bagasdotme@gmail.com>
Date: Wed, 18 Jun 2025 18:15:34 +0700
From: Bagas Sanjaya <bagasdotme@...il.com>
To: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Documentation <linux-doc@...r.kernel.org>,
Linux ext4 <linux-ext4@...r.kernel.org>
Cc: "Theodore Ts'o" <tytso@....edu>,
Andreas Dilger <adilger.kernel@...ger.ca>,
Jonathan Corbet <corbet@....net>,
"Darrick J. Wong" <djwong@...nel.org>,
"Ritesh Harjani (IBM)" <ritesh.list@...il.com>,
Bagas Sanjaya <bagasdotme@...il.com>
Subject: [PATCH 1/4] Documentation: ext4: Slurp included subdocs in high-level overview docs
Slurp subdocumentations for high-level ext4 design overview
(overview.rst) by replacing reST include:: directive with their
respective contents.
Signed-off-by: Bagas Sanjaya <bagasdotme@...il.com>
---
Documentation/filesystems/ext4/allocators.rst | 56 --
.../filesystems/ext4/atomic_writes.rst | 225 -----
Documentation/filesystems/ext4/bigalloc.rst | 34 -
Documentation/filesystems/ext4/blockgroup.rst | 135 ---
Documentation/filesystems/ext4/blocks.rst | 144 ---
Documentation/filesystems/ext4/checksums.rst | 73 --
Documentation/filesystems/ext4/eainode.rst | 18 -
Documentation/filesystems/ext4/inlinedata.rst | 37 -
Documentation/filesystems/ext4/overview.rst | 819 +++++++++++++++++-
.../filesystems/ext4/special_inodes.rst | 55 --
Documentation/filesystems/ext4/verity.rst | 44 -
11 files changed, 809 insertions(+), 831 deletions(-)
delete mode 100644 Documentation/filesystems/ext4/allocators.rst
delete mode 100644 Documentation/filesystems/ext4/atomic_writes.rst
delete mode 100644 Documentation/filesystems/ext4/bigalloc.rst
delete mode 100644 Documentation/filesystems/ext4/blockgroup.rst
delete mode 100644 Documentation/filesystems/ext4/blocks.rst
delete mode 100644 Documentation/filesystems/ext4/checksums.rst
delete mode 100644 Documentation/filesystems/ext4/eainode.rst
delete mode 100644 Documentation/filesystems/ext4/inlinedata.rst
delete mode 100644 Documentation/filesystems/ext4/special_inodes.rst
delete mode 100644 Documentation/filesystems/ext4/verity.rst
diff --git a/Documentation/filesystems/ext4/allocators.rst b/Documentation/filesystems/ext4/allocators.rst
deleted file mode 100644
index 7aa85152ace3d0..00000000000000
--- a/Documentation/filesystems/ext4/allocators.rst
+++ /dev/null
@@ -1,56 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Block and Inode Allocation Policy
----------------------------------
-
-ext4 recognizes (better than ext3, anyway) that data locality is
-generally a desirably quality of a filesystem. On a spinning disk,
-keeping related blocks near each other reduces the amount of movement
-that the head actuator and disk must perform to access a data block,
-thus speeding up disk IO. On an SSD there of course are no moving parts,
-but locality can increase the size of each transfer request while
-reducing the total number of requests. This locality may also have the
-effect of concentrating writes on a single erase block, which can speed
-up file rewrites significantly. Therefore, it is useful to reduce
-fragmentation whenever possible.
-
-The first tool that ext4 uses to combat fragmentation is the multi-block
-allocator. When a file is first created, the block allocator
-speculatively allocates 8KiB of disk space to the file on the assumption
-that the space will get written soon. When the file is closed, the
-unused speculative allocations are of course freed, but if the
-speculation is correct (typically the case for full writes of small
-files) then the file data gets written out in a single multi-block
-extent. A second related trick that ext4 uses is delayed allocation.
-Under this scheme, when a file needs more blocks to absorb file writes,
-the filesystem defers deciding the exact placement on the disk until all
-the dirty buffers are being written out to disk. By not committing to a
-particular placement until it's absolutely necessary (the commit timeout
-is hit, or sync() is called, or the kernel runs out of memory), the hope
-is that the filesystem can make better location decisions.
-
-The third trick that ext4 (and ext3) uses is that it tries to keep a
-file's data blocks in the same block group as its inode. This cuts down
-on the seek penalty when the filesystem first has to read a file's inode
-to learn where the file's data blocks live and then seek over to the
-file's data blocks to begin I/O operations.
-
-The fourth trick is that all the inodes in a directory are placed in the
-same block group as the directory, when feasible. The working assumption
-here is that all the files in a directory might be related, therefore it
-is useful to try to keep them all together.
-
-The fifth trick is that the disk volume is cut up into 128MB block
-groups; these mini-containers are used as outlined above to try to
-maintain data locality. However, there is a deliberate quirk -- when a
-directory is created in the root directory, the inode allocator scans
-the block groups and puts that directory into the least heavily loaded
-block group that it can find. This encourages directories to spread out
-over a disk; as the top-level directory/file blobs fill up one block
-group, the allocators simply move on to the next block group. Allegedly
-this scheme evens out the loading on the block groups, though the author
-suspects that the directories which are so unlucky as to land towards
-the end of a spinning drive get a raw deal performance-wise.
-
-Of course if all of these mechanisms fail, one can always use e4defrag
-to defragment files.
diff --git a/Documentation/filesystems/ext4/atomic_writes.rst b/Documentation/filesystems/ext4/atomic_writes.rst
deleted file mode 100644
index f65767df3620d5..00000000000000
--- a/Documentation/filesystems/ext4/atomic_writes.rst
+++ /dev/null
@@ -1,225 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-.. _atomic_writes:
-
-Atomic Block Writes
--------------------------
-
-Introduction
-~~~~~~~~~~~~
-
-Atomic (untorn) block writes ensure that either the entire write is committed
-to disk or none of it is. This prevents "torn writes" during power loss or
-system crashes. The ext4 filesystem supports atomic writes (only with Direct
-I/O) on regular files with extents, provided the underlying storage device
-supports hardware atomic writes. This is supported in the following two ways:
-
-1. **Single-fsblock Atomic Writes**:
- EXT4's supports atomic write operations with a single filesystem block since
- v6.13. In this the atomic write unit minimum and maximum sizes are both set
- to filesystem blocksize.
- e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
- pagesize system is possible.
-
-2. **Multi-fsblock Atomic Writes with Bigalloc**:
- EXT4 now also supports atomic writes spanning multiple filesystem blocks
- using a feature known as bigalloc. The atomic write unit's minimum and
- maximum sizes are determined by the filesystem block size and cluster size,
- based on the underlying device’s supported atomic write unit limits.
-
-Requirements
-~~~~~~~~~~~~
-
-Basic requirements for atomic writes in ext4:
-
- 1. The extents feature must be enabled (default for ext4)
- 2. The underlying block device must support atomic writes
- 3. For single-fsblock atomic writes:
-
- 1. A filesystem with appropriate block size (up to the page size)
- 4. For multi-fsblock atomic writes:
-
- 1. The bigalloc feature must be enabled
- 2. The cluster size must be appropriately configured
-
-NOTE: EXT4 does not support software or COW based atomic write, which means
-atomic writes on ext4 are only supported if underlying storage device supports
-it.
-
-Multi-fsblock Implementation Details
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-The bigalloc feature changes ext4 to allocate in units of multiple filesystem
-blocks, also known as clusters. With bigalloc each bit within block bitmap
-represents cluster (power of 2 number of blocks) rather than individual
-filesystem blocks.
-EXT4 supports multi-fsblock atomic writes with bigalloc, subject to the
-following constraints. The minimum atomic write size is the larger of the fs
-block size and the minimum hardware atomic write unit; and the maximum atomic
-write size is smaller of the bigalloc cluster size and the maximum hardware
-atomic write unit. Bigalloc ensures that all allocations are aligned to the
-cluster size, which satisfies the LBA alignment requirements of the hardware
-device if the start of the partition/logical volume is itself aligned correctly.
-
-Here is the block allocation strategy in bigalloc for atomic writes:
-
- * For regions with fully mapped extents, no additional work is needed
- * For append writes, a new mapped extent is allocated
- * For regions that are entirely holes, unwritten extent is created
- * For large unwritten extents, the extent gets split into two unwritten
- extents of appropriate requested size
- * For mixed mapping regions (combinations of holes, unwritten extents, or
- mapped extents), ext4_map_blocks() is called in a loop with
- EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
- mapped extent by writing zeroes to it and converting any unwritten extents to
- written, if found within the range.
-
-Note: Writing on a single contiguous underlying extent, whether mapped or
-unwritten, is not inherently problematic. However, writing to a mixed mapping
-region (i.e. one containing a combination of mapped and unwritten extents)
-must be avoided when performing atomic writes.
-
-The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
-flag, requires that either all data is written or none at all. In the event of
-a system crash or unexpected power loss during the write operation, the affected
-region (when later read) must reflect either the complete old data or the
-complete new data, but never a mix of both.
-
-To enforce this guarantee, we ensure that the write target is backed by
-a single, contiguous extent before any data is written. This is critical because
-ext4 defers the conversion of unwritten extents to written extents until the I/O
-completion path (typically in ->end_io()). If a write is allowed to proceed over
-a mixed mapping region (with mapped and unwritten extents) and a failure occurs
-mid-write, the system could observe partially updated regions after reboot, i.e.
-new data over mapped areas, and stale (old) data over unwritten extents that
-were never marked written. This violates the atomicity and/or torn write
-prevention guarantee.
-
-To prevent such torn writes, ext4 proactively allocates a single contiguous
-extent for the entire requested region in ``ext4_iomap_alloc`` via
-``ext4_map_blocks_atomic()``. EXT4 also force commits the current journalling
-transaction in case if allocation is done over mixed mapping. This ensures any
-pending metadata updates (like unwritten to written extents conversion) in this
-range are in consistent state with the file data blocks, before performing the
-actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
-from any possible torn writes.
-Only after this step, the actual data write operation is performed by the iomap.
-
-Handling Split Extents Across Leaf Blocks
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-There can be a special edge case where we have logically and physically
-contiguous extents stored in separate leaf nodes of the on-disk extent tree.
-This occurs because on-disk extent tree merges only happens within the leaf
-blocks except for a case where we have 2-level tree which can get merged and
-collapsed entirely into the inode.
-If such a layout exists and, in the worst case, the extent status cache entries
-are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
-a single contiguous extent for these split leaf extents.
-
-To address this edge case, a new get block flag
-``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
-``ext4_map_query_blocks()`` lookup behavior.
-
-This new get block flag allows ``ext4_map_blocks()`` to first check if there is
-an entry in the extent status cache for the full range.
-If not present, it consults the on-disk extent tree using
-``ext4_map_query_blocks()``.
-If the located extent is at the end of a leaf node, it probes the next logical
-block (lblk) to detect a contiguous extent in the adjacent leaf.
-
-For now only one additional leaf block is queried to maintain efficiency, as
-atomic writes are typically constrained to small sizes
-(e.g. [blocksize, clustersize]).
-
-
-Handling Journal transactions
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-To support multi-fsblock atomic writes, we ensure enough journal credits are
-reserved during:
-
- 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
- could be a mixed mapping for the underlying requested range. If yes, then we
- reserve credits of up to ``m_len``, assuming every alternate block can be
- an unwritten extent followed by a hole.
-
- 2. During ``->end_io()`` call, we make sure a single transaction is started for
- doing unwritten-to-written conversion. The loop for conversion is mainly
- only required to handle a split extent across leaf blocks.
-
-How to
-------
-
-Creating Filesystems with Atomic Write Support
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-First check the atomic write units supported by block device.
-See :ref:`atomic_write_bdev_support` for more details.
-
-For single-fsblock atomic writes with a larger block size
-(on systems with block size < page size):
-
-.. code-block:: bash
-
- # Create an ext4 filesystem with a 16KB block size
- # (requires page size >= 16KB)
- mkfs.ext4 -b 16384 /dev/device
-
-For multi-fsblock atomic writes with bigalloc:
-
-.. code-block:: bash
-
- # Create an ext4 filesystem with bigalloc and 64KB cluster size
- mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
-
-Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
-and ``-O bigalloc`` enables the bigalloc feature.
-
-Application Interface
-~~~~~~~~~~~~~~~~~~~~~
-
-Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
-to perform atomic writes:
-
-.. code-block:: c
-
- pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
-
-The write must be aligned to the filesystem's block size and not exceed the
-filesystem's maximum atomic write unit size.
-See ``generic_atomic_write_valid()`` for more details.
-
-``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
-details:
-
- * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
- * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
- * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
- separate memory buffers that can be gathered into a write operation
- (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
-
-The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
-writes are supported.
-
-.. _atomic_write_bdev_support:
-
-Hardware Support
-----------------
-
-The underlying storage device must support atomic write operations.
-Modern NVMe and SCSI devices often provide this capability.
-The Linux kernel exposes this information through sysfs:
-
-* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
-* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
-
-Nonzero values for these attributes indicate that the device supports
-atomic writes.
-
-See Also
---------
-
-* :doc:`bigalloc` - Documentation on the bigalloc feature
-* :doc:`allocators` - Documentation on block allocation in ext4
-* Support for atomic block writes in 6.13:
- https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/bigalloc.rst b/Documentation/filesystems/ext4/bigalloc.rst
deleted file mode 100644
index 976a180b209c2a..00000000000000
--- a/Documentation/filesystems/ext4/bigalloc.rst
+++ /dev/null
@@ -1,34 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Bigalloc
---------
-
-At the moment, the default size of a block is 4KiB, which is a commonly
-supported page size on most MMU-capable hardware. This is fortunate, as
-ext4 code is not prepared to handle the case where the block size
-exceeds the page size. However, for a filesystem of mostly huge files,
-it is desirable to be able to allocate disk blocks in units of multiple
-blocks to reduce both fragmentation and metadata overhead. The
-bigalloc feature provides exactly this ability.
-
-The bigalloc feature (EXT4_FEATURE_RO_COMPAT_BIGALLOC) changes ext4 to
-use clustered allocation, so that each bit in the ext4 block allocation
-bitmap addresses a power of two number of blocks. For example, if the
-file system is mainly going to be storing large files in the 4-32
-megabyte range, it might make sense to set a cluster size of 1 megabyte.
-This means that each bit in the block allocation bitmap now addresses
-256 4k blocks. This shrinks the total size of the block allocation
-bitmaps for a 2T file system from 64 megabytes to 256 kilobytes. It also
-means that a block group addresses 32 gigabytes instead of 128 megabytes,
-also shrinking the amount of file system overhead for metadata.
-
-The administrator can set a block cluster size at mkfs time (which is
-stored in the s_log_cluster_size field in the superblock); from then
-on, the block bitmaps track clusters, not individual blocks. This means
-that block groups can be several gigabytes in size (instead of just
-128MiB); however, the minimum allocation unit becomes a cluster, not a
-block, even for directories. TaoBao had a patchset to extend the “use
-units of clusters instead of blocks” to the extent tree, though it is
-not clear where those patches went-- they eventually morphed into
-“extent tree v2” but that code has not landed as of May 2015.
-
diff --git a/Documentation/filesystems/ext4/blockgroup.rst b/Documentation/filesystems/ext4/blockgroup.rst
deleted file mode 100644
index ed5a5cac6d40e1..00000000000000
--- a/Documentation/filesystems/ext4/blockgroup.rst
+++ /dev/null
@@ -1,135 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Layout
-------
-
-The layout of a standard block group is approximately as follows (each
-of these fields is discussed in a separate section below):
-
-.. list-table::
- :widths: 1 1 1 1 1 1 1 1
- :header-rows: 1
-
- * - Group 0 Padding
- - ext4 Super Block
- - Group Descriptors
- - Reserved GDT Blocks
- - Data Block Bitmap
- - inode Bitmap
- - inode Table
- - Data Blocks
- * - 1024 bytes
- - 1 block
- - many blocks
- - many blocks
- - 1 block
- - 1 block
- - many blocks
- - many more blocks
-
-For the special case of block group 0, the first 1024 bytes are unused,
-to allow for the installation of x86 boot sectors and other oddities.
-The superblock will start at offset 1024 bytes, whichever block that
-happens to be (usually 0). However, if for some reason the block size =
-1024, then block 0 is marked in use and the superblock goes in block 1.
-For all other block groups, there is no padding.
-
-The ext4 driver primarily works with the superblock and the group
-descriptors that are found in block group 0. Redundant copies of the
-superblock and group descriptors are written to some of the block groups
-across the disk in case the beginning of the disk gets trashed, though
-not all block groups necessarily host a redundant copy (see following
-paragraph for more details). If the group does not have a redundant
-copy, the block group begins with the data block bitmap. Note also that
-when the filesystem is freshly formatted, mkfs will allocate “reserve
-GDT block” space after the block group descriptors and before the start
-of the block bitmaps to allow for future expansion of the filesystem. By
-default, a filesystem is allowed to increase in size by a factor of
-1024x over the original filesystem size.
-
-The location of the inode table is given by ``grp.bg_inode_table_*``. It
-is continuous range of blocks large enough to contain
-``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
-
-As for the ordering of items in a block group, it is generally
-established that the super block and the group descriptor table, if
-present, will be at the beginning of the block group. The bitmaps and
-the inode table can be anywhere, and it is quite possible for the
-bitmaps to come after the inode table, or for both to be in different
-groups (flex_bg). Leftover space is used for file data blocks, indirect
-block maps, extent tree blocks, and extended attributes.
-
-Flexible Block Groups
----------------------
-
-Starting in ext4, there is a new feature called flexible block groups
-(flex_bg). In a flex_bg, several block groups are tied together as one
-logical block group; the bitmap spaces and the inode table space in the
-first block group of the flex_bg are expanded to include the bitmaps
-and inode tables of all other block groups in the flex_bg. For example,
-if the flex_bg size is 4, then group 0 will contain (in order) the
-superblock, group descriptors, data block bitmaps for groups 0-3, inode
-bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
-space in group 0 is for file data. The effect of this is to group the
-block group metadata close together for faster loading, and to enable
-large files to be continuous on disk. Backup copies of the superblock
-and group descriptors are always at the beginning of block groups, even
-if flex_bg is enabled. The number of block groups that make up a
-flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
-
-Meta Block Groups
------------------
-
-Without the option META_BG, for safety concerns, all block group
-descriptors copies are kept in the first block group. Given the default
-128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
-can have at most 2^27/64 = 2^21 block groups. This limits the entire
-filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
-
-The solution to this problem is to use the metablock group feature
-(META_BG), which is already in ext3 for all 2.6 releases. With the
-META_BG feature, ext4 filesystems are partitioned into many metablock
-groups. Each metablock group is a cluster of block groups whose group
-descriptor structures can be stored in a single disk block. For ext4
-filesystems with 4 KB block size, a single metablock group partition
-includes 64 block groups, or 8 GiB of disk space. The metablock group
-feature moves the location of the group descriptors from the congested
-first block group of the whole filesystem into the first group of each
-metablock group itself. The backups are in the second and last group of
-each metablock group. This increases the 2^21 maximum block groups limit
-to the hard limit 2^32, allowing support for a 512PiB filesystem.
-
-The change in the filesystem format replaces the current scheme where
-the superblock is followed by a variable-length set of block group
-descriptors. Instead, the superblock and a single block group descriptor
-block is placed at the beginning of the first, second, and last block
-groups in a meta-block group. A meta-block group is a collection of
-block groups which can be described by a single block group descriptor
-block. Since the size of the block group descriptor structure is 64
-bytes, a meta-block group contains 16 block groups for filesystems with
-a 1KB block size, and 64 block groups for filesystems with a 4KB
-blocksize. Filesystems can either be created using this new block group
-descriptor layout, or existing filesystems can be resized on-line, and
-the field s_first_meta_bg in the superblock will indicate the first
-block group using this new layout.
-
-Please see an important note about ``BLOCK_UNINIT`` in the section about
-block and inode bitmaps.
-
-Lazy Block Group Initialization
--------------------------------
-
-A new feature for ext4 are three block group descriptor flags that
-enable mkfs to skip initializing other parts of the block group
-metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
-that the inode and block bitmaps for that group can be calculated and
-therefore the on-disk bitmap blocks are not initialized. This is
-generally the case for an empty block group or a block group containing
-only fixed-location block group metadata. The INODE_ZEROED flag means
-that the inode table has been initialized; mkfs will unset this flag and
-rely on the kernel to initialize the inode tables in the background.
-
-By not writing zeroes to the bitmaps and inode table, mkfs time is
-reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
-but the dumpe2fs output prints this as “uninit_bg”. They are the same
-thing.
diff --git a/Documentation/filesystems/ext4/blocks.rst b/Documentation/filesystems/ext4/blocks.rst
deleted file mode 100644
index b0f80ea87c90e1..00000000000000
--- a/Documentation/filesystems/ext4/blocks.rst
+++ /dev/null
@@ -1,144 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Blocks
-------
-
-ext4 allocates storage space in units of “blocks”. A block is a group of
-sectors between 1KiB and 64KiB, and the number of sectors must be an
-integral power of 2. Blocks are in turn grouped into larger units called
-block groups. Block size is specified at mkfs time and typically is
-4KiB. You may experience mounting problems if block size is greater than
-page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory
-pages). By default a filesystem can contain 2^32 blocks; if the '64bit'
-feature is enabled, then a filesystem can have 2^64 blocks. The location
-of structures is stored in terms of the block number the structure lives
-in and not the absolute offset on disk.
-
-For 32-bit filesystems, limits are as follows:
-
-.. list-table::
- :widths: 1 1 1 1 1
- :header-rows: 1
-
- * - Item
- - 1KiB
- - 2KiB
- - 4KiB
- - 64KiB
- * - Blocks
- - 2^32
- - 2^32
- - 2^32
- - 2^32
- * - Inodes
- - 2^32
- - 2^32
- - 2^32
- - 2^32
- * - File System Size
- - 4TiB
- - 8TiB
- - 16TiB
- - 256TiB
- * - Blocks Per Block Group
- - 8,192
- - 16,384
- - 32,768
- - 524,288
- * - Inodes Per Block Group
- - 8,192
- - 16,384
- - 32,768
- - 524,288
- * - Block Group Size
- - 8MiB
- - 32MiB
- - 128MiB
- - 32GiB
- * - Blocks Per File, Extents
- - 2^32
- - 2^32
- - 2^32
- - 2^32
- * - Blocks Per File, Block Maps
- - 16,843,020
- - 134,480,396
- - 1,074,791,436
- - 4,398,314,962,956 (really 2^32 due to field size limitations)
- * - File Size, Extents
- - 4TiB
- - 8TiB
- - 16TiB
- - 256TiB
- * - File Size, Block Maps
- - 16GiB
- - 256GiB
- - 4TiB
- - 256TiB
-
-For 64-bit filesystems, limits are as follows:
-
-.. list-table::
- :widths: 1 1 1 1 1
- :header-rows: 1
-
- * - Item
- - 1KiB
- - 2KiB
- - 4KiB
- - 64KiB
- * - Blocks
- - 2^64
- - 2^64
- - 2^64
- - 2^64
- * - Inodes
- - 2^32
- - 2^32
- - 2^32
- - 2^32
- * - File System Size
- - 16ZiB
- - 32ZiB
- - 64ZiB
- - 1YiB
- * - Blocks Per Block Group
- - 8,192
- - 16,384
- - 32,768
- - 524,288
- * - Inodes Per Block Group
- - 8,192
- - 16,384
- - 32,768
- - 524,288
- * - Block Group Size
- - 8MiB
- - 32MiB
- - 128MiB
- - 32GiB
- * - Blocks Per File, Extents
- - 2^32
- - 2^32
- - 2^32
- - 2^32
- * - Blocks Per File, Block Maps
- - 16,843,020
- - 134,480,396
- - 1,074,791,436
- - 4,398,314,962,956 (really 2^32 due to field size limitations)
- * - File Size, Extents
- - 4TiB
- - 8TiB
- - 16TiB
- - 256TiB
- * - File Size, Block Maps
- - 16GiB
- - 256GiB
- - 4TiB
- - 256TiB
-
-Note: Files not using extents (i.e. files using block maps) must be
-placed within the first 2^32 blocks of a filesystem. Files with extents
-must be placed within the first 2^48 blocks of a filesystem. It's not
-clear what happens with larger filesystems.
diff --git a/Documentation/filesystems/ext4/checksums.rst b/Documentation/filesystems/ext4/checksums.rst
deleted file mode 100644
index e232749daf5f30..00000000000000
--- a/Documentation/filesystems/ext4/checksums.rst
+++ /dev/null
@@ -1,73 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Checksums
----------
-
-Starting in early 2012, metadata checksums were added to all major ext4
-and jbd2 data structures. The associated feature flag is metadata_csum.
-The desired checksum algorithm is indicated in the superblock, though as
-of October 2012 the only supported algorithm is crc32c. Some data
-structures did not have space to fit a full 32-bit checksum, so only the
-lower 16 bits are stored. Enabling the 64bit feature increases the data
-structure size so that full 32-bit checksums can be stored for many data
-structures. However, existing 32-bit filesystems cannot be extended to
-enable 64bit mode, at least not without the experimental resize2fs
-patches to do so.
-
-Existing filesystems can have checksumming added by running
-``tune2fs -O metadata_csum`` against the underlying device. If tune2fs
-encounters directory blocks that lack sufficient empty space to add a
-checksum, it will request that you run ``e2fsck -D`` to have the
-directories rebuilt with checksums. This has the added benefit of
-removing slack space from the directory files and rebalancing the htree
-indexes. If you _ignore_ this step, your directories will not be
-protected by a checksum!
-
-The following table describes the data elements that go into each type
-of checksum. The checksum function is whatever the superblock describes
-(crc32c as of October 2013) unless noted otherwise.
-
-.. list-table::
- :widths: 20 8 50
- :header-rows: 1
-
- * - Metadata
- - Length
- - Ingredients
- * - Superblock
- - __le32
- - The entire superblock up to the checksum field. The UUID lives inside
- the superblock.
- * - MMP
- - __le32
- - UUID + the entire MMP block up to the checksum field.
- * - Extended Attributes
- - __le32
- - UUID + the entire extended attribute block. The checksum field is set to
- zero.
- * - Directory Entries
- - __le32
- - UUID + inode number + inode generation + the directory block up to the
- fake entry enclosing the checksum field.
- * - HTREE Nodes
- - __le32
- - UUID + inode number + inode generation + all valid extents + HTREE tail.
- The checksum field is set to zero.
- * - Extents
- - __le32
- - UUID + inode number + inode generation + the entire extent block up to
- the checksum field.
- * - Bitmaps
- - __le32 or __le16
- - UUID + the entire bitmap. Checksums are stored in the group descriptor,
- and truncated if the group descriptor size is 32 bytes (i.e. ^64bit)
- * - Inodes
- - __le32
- - UUID + inode number + inode generation + the entire inode. The checksum
- field is set to zero. Each inode has its own checksum.
- * - Group Descriptors
- - __le16
- - If metadata_csum, then UUID + group number + the entire descriptor;
- else if gdt_csum, then crc16(UUID + group number + the entire
- descriptor). In all cases, only the lower 16 bits are stored.
-
diff --git a/Documentation/filesystems/ext4/eainode.rst b/Documentation/filesystems/ext4/eainode.rst
deleted file mode 100644
index 7a2ef26b064ac0..00000000000000
--- a/Documentation/filesystems/ext4/eainode.rst
+++ /dev/null
@@ -1,18 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Large Extended Attribute Values
--------------------------------
-
-To enable ext4 to store extended attribute values that do not fit in the
-inode or in the single extended attribute block attached to an inode,
-the EA_INODE feature allows us to store the value in the data blocks of
-a regular file inode. This “EA inode” is linked only from the extended
-attribute name index and must not appear in a directory entry. The
-inode's i_atime field is used to store a checksum of the xattr value;
-and i_ctime/i_version store a 64-bit reference count, which enables
-sharing of large xattr values between multiple owning inodes. For
-backward compatibility with older versions of this feature, the
-i_mtime/i_generation *may* store a back-reference to the inode number
-and i_generation of the **one** owning inode (in cases where the EA
-inode is not referenced by multiple inodes) to verify that the EA inode
-is the correct one being accessed.
diff --git a/Documentation/filesystems/ext4/inlinedata.rst b/Documentation/filesystems/ext4/inlinedata.rst
deleted file mode 100644
index a728af0d2fd0c5..00000000000000
--- a/Documentation/filesystems/ext4/inlinedata.rst
+++ /dev/null
@@ -1,37 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Inline Data
------------
-
-The inline data feature was designed to handle the case that a file's
-data is so tiny that it readily fits inside the inode, which
-(theoretically) reduces disk block consumption and reduces seeks. If the
-file is smaller than 60 bytes, then the data are stored inline in
-``inode.i_block``. If the rest of the file would fit inside the extended
-attribute space, then it might be found as an extended attribute
-“system.data” within the inode body (“ibody EA”). This of course
-constrains the amount of extended attributes one can attach to an inode.
-If the data size increases beyond i_block + ibody EA, a regular block
-is allocated and the contents moved to that block.
-
-Pending a change to compact the extended attribute key used to store
-inline data, one ought to be able to store 160 bytes of data in a
-256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to
-that, the limit was 156 bytes due to inefficient use of inode space.
-
-The inline data feature requires the presence of an extended attribute
-for “system.data”, even if the attribute value is zero length.
-
-Inline Directories
-~~~~~~~~~~~~~~~~~~
-
-The first four bytes of i_block are the inode number of the parent
-directory. Following that is a 56-byte space for an array of directory
-entries; see ``struct ext4_dir_entry``. If there is a “system.data”
-attribute in the inode body, the EA value is an array of
-``struct ext4_dir_entry`` as well. Note that for inline directories, the
-i_block and EA space are treated as separate dirent blocks; directory
-entries cannot span the two.
-
-Inline directory entries are not checksummed, as the inode checksum
-should protect all inline data contents.
diff --git a/Documentation/filesystems/ext4/overview.rst b/Documentation/filesystems/ext4/overview.rst
index 9d4054c17ecb7b..f402ba37179f02 100644
--- a/Documentation/filesystems/ext4/overview.rst
+++ b/Documentation/filesystems/ext4/overview.rst
@@ -16,13 +16,812 @@ All fields in ext4 are written to disk in little-endian order. HOWEVER,
all fields in jbd2 (the journal) are written to disk in big-endian
order.
-.. include:: blocks.rst
-.. include:: blockgroup.rst
-.. include:: special_inodes.rst
-.. include:: allocators.rst
-.. include:: checksums.rst
-.. include:: bigalloc.rst
-.. include:: inlinedata.rst
-.. include:: eainode.rst
-.. include:: verity.rst
-.. include:: atomic_writes.rst
+Blocks
+------
+
+ext4 allocates storage space in units of “blocks”. A block is a group of
+sectors between 1KiB and 64KiB, and the number of sectors must be an
+integral power of 2. Blocks are in turn grouped into larger units called
+block groups. Block size is specified at mkfs time and typically is
+4KiB. You may experience mounting problems if block size is greater than
+page size (i.e. 64KiB blocks on a i386 which only has 4KiB memory
+pages). By default a filesystem can contain 2^32 blocks; if the '64bit'
+feature is enabled, then a filesystem can have 2^64 blocks. The location
+of structures is stored in terms of the block number the structure lives
+in and not the absolute offset on disk.
+
+For 32-bit filesystems, limits are as follows:
+
+.. list-table::
+ :widths: 1 1 1 1 1
+ :header-rows: 1
+
+ * - Item
+ - 1KiB
+ - 2KiB
+ - 4KiB
+ - 64KiB
+ * - Blocks
+ - 2^32
+ - 2^32
+ - 2^32
+ - 2^32
+ * - Inodes
+ - 2^32
+ - 2^32
+ - 2^32
+ - 2^32
+ * - File System Size
+ - 4TiB
+ - 8TiB
+ - 16TiB
+ - 256TiB
+ * - Blocks Per Block Group
+ - 8,192
+ - 16,384
+ - 32,768
+ - 524,288
+ * - Inodes Per Block Group
+ - 8,192
+ - 16,384
+ - 32,768
+ - 524,288
+ * - Block Group Size
+ - 8MiB
+ - 32MiB
+ - 128MiB
+ - 32GiB
+ * - Blocks Per File, Extents
+ - 2^32
+ - 2^32
+ - 2^32
+ - 2^32
+ * - Blocks Per File, Block Maps
+ - 16,843,020
+ - 134,480,396
+ - 1,074,791,436
+ - 4,398,314,962,956 (really 2^32 due to field size limitations)
+ * - File Size, Extents
+ - 4TiB
+ - 8TiB
+ - 16TiB
+ - 256TiB
+ * - File Size, Block Maps
+ - 16GiB
+ - 256GiB
+ - 4TiB
+ - 256TiB
+
+For 64-bit filesystems, limits are as follows:
+
+.. list-table::
+ :widths: 1 1 1 1 1
+ :header-rows: 1
+
+ * - Item
+ - 1KiB
+ - 2KiB
+ - 4KiB
+ - 64KiB
+ * - Blocks
+ - 2^64
+ - 2^64
+ - 2^64
+ - 2^64
+ * - Inodes
+ - 2^32
+ - 2^32
+ - 2^32
+ - 2^32
+ * - File System Size
+ - 16ZiB
+ - 32ZiB
+ - 64ZiB
+ - 1YiB
+ * - Blocks Per Block Group
+ - 8,192
+ - 16,384
+ - 32,768
+ - 524,288
+ * - Inodes Per Block Group
+ - 8,192
+ - 16,384
+ - 32,768
+ - 524,288
+ * - Block Group Size
+ - 8MiB
+ - 32MiB
+ - 128MiB
+ - 32GiB
+ * - Blocks Per File, Extents
+ - 2^32
+ - 2^32
+ - 2^32
+ - 2^32
+ * - Blocks Per File, Block Maps
+ - 16,843,020
+ - 134,480,396
+ - 1,074,791,436
+ - 4,398,314,962,956 (really 2^32 due to field size limitations)
+ * - File Size, Extents
+ - 4TiB
+ - 8TiB
+ - 16TiB
+ - 256TiB
+ * - File Size, Block Maps
+ - 16GiB
+ - 256GiB
+ - 4TiB
+ - 256TiB
+
+.. note::
+ Files not using extents (i.e. files using block maps) must be
+ placed within the first 2^32 blocks of a filesystem. Files with extents
+ must be placed within the first 2^48 blocks of a filesystem. It's not
+ clear what happens with larger filesystems.
+
+Layout
+------
+
+The layout of a standard block group is approximately as follows (each
+of these fields is discussed in a separate section below):
+
+.. list-table::
+ :widths: 1 1 1 1 1 1 1 1
+ :header-rows: 1
+
+ * - Group 0 Padding
+ - ext4 Super Block
+ - Group Descriptors
+ - Reserved GDT Blocks
+ - Data Block Bitmap
+ - inode Bitmap
+ - inode Table
+ - Data Blocks
+ * - 1024 bytes
+ - 1 block
+ - many blocks
+ - many blocks
+ - 1 block
+ - 1 block
+ - many blocks
+ - many more blocks
+
+For the special case of block group 0, the first 1024 bytes are unused,
+to allow for the installation of x86 boot sectors and other oddities.
+The superblock will start at offset 1024 bytes, whichever block that
+happens to be (usually 0). However, if for some reason the block size =
+1024, then block 0 is marked in use and the superblock goes in block 1.
+For all other block groups, there is no padding.
+
+The ext4 driver primarily works with the superblock and the group
+descriptors that are found in block group 0. Redundant copies of the
+superblock and group descriptors are written to some of the block groups
+across the disk in case the beginning of the disk gets trashed, though
+not all block groups necessarily host a redundant copy (see following
+paragraph for more details). If the group does not have a redundant
+copy, the block group begins with the data block bitmap. Note also that
+when the filesystem is freshly formatted, mkfs will allocate “reserve
+GDT block” space after the block group descriptors and before the start
+of the block bitmaps to allow for future expansion of the filesystem. By
+default, a filesystem is allowed to increase in size by a factor of
+1024x over the original filesystem size.
+
+The location of the inode table is given by ``grp.bg_inode_table_*``. It
+is continuous range of blocks large enough to contain
+``sb.s_inodes_per_group * sb.s_inode_size`` bytes.
+
+As for the ordering of items in a block group, it is generally
+established that the super block and the group descriptor table, if
+present, will be at the beginning of the block group. The bitmaps and
+the inode table can be anywhere, and it is quite possible for the
+bitmaps to come after the inode table, or for both to be in different
+groups (flex_bg). Leftover space is used for file data blocks, indirect
+block maps, extent tree blocks, and extended attributes.
+
+Flexible Block Groups
+---------------------
+
+Starting in ext4, there is a new feature called flexible block groups
+(flex_bg). In a flex_bg, several block groups are tied together as one
+logical block group; the bitmap spaces and the inode table space in the
+first block group of the flex_bg are expanded to include the bitmaps
+and inode tables of all other block groups in the flex_bg. For example,
+if the flex_bg size is 4, then group 0 will contain (in order) the
+superblock, group descriptors, data block bitmaps for groups 0-3, inode
+bitmaps for groups 0-3, inode tables for groups 0-3, and the remaining
+space in group 0 is for file data. The effect of this is to group the
+block group metadata close together for faster loading, and to enable
+large files to be continuous on disk. Backup copies of the superblock
+and group descriptors are always at the beginning of block groups, even
+if flex_bg is enabled. The number of block groups that make up a
+flex_bg is given by 2 ^ ``sb.s_log_groups_per_flex``.
+
+Meta Block Groups
+-----------------
+
+Without the option META_BG, for safety concerns, all block group
+descriptors copies are kept in the first block group. Given the default
+128MiB(2^27 bytes) block group size and 64-byte group descriptors, ext4
+can have at most 2^27/64 = 2^21 block groups. This limits the entire
+filesystem size to 2^21 * 2^27 = 2^48bytes or 256TiB.
+
+The solution to this problem is to use the metablock group feature
+(META_BG), which is already in ext3 for all 2.6 releases. With the
+META_BG feature, ext4 filesystems are partitioned into many metablock
+groups. Each metablock group is a cluster of block groups whose group
+descriptor structures can be stored in a single disk block. For ext4
+filesystems with 4 KB block size, a single metablock group partition
+includes 64 block groups, or 8 GiB of disk space. The metablock group
+feature moves the location of the group descriptors from the congested
+first block group of the whole filesystem into the first group of each
+metablock group itself. The backups are in the second and last group of
+each metablock group. This increases the 2^21 maximum block groups limit
+to the hard limit 2^32, allowing support for a 512PiB filesystem.
+
+The change in the filesystem format replaces the current scheme where
+the superblock is followed by a variable-length set of block group
+descriptors. Instead, the superblock and a single block group descriptor
+block is placed at the beginning of the first, second, and last block
+groups in a meta-block group. A meta-block group is a collection of
+block groups which can be described by a single block group descriptor
+block. Since the size of the block group descriptor structure is 64
+bytes, a meta-block group contains 16 block groups for filesystems with
+a 1KB block size, and 64 block groups for filesystems with a 4KB
+blocksize. Filesystems can either be created using this new block group
+descriptor layout, or existing filesystems can be resized on-line, and
+the field s_first_meta_bg in the superblock will indicate the first
+block group using this new layout.
+
+Please see an important note about ``BLOCK_UNINIT`` in the section about
+block and inode bitmaps.
+
+Lazy Block Group Initialization
+-------------------------------
+
+A new feature for ext4 are three block group descriptor flags that
+enable mkfs to skip initializing other parts of the block group
+metadata. Specifically, the INODE_UNINIT and BLOCK_UNINIT flags mean
+that the inode and block bitmaps for that group can be calculated and
+therefore the on-disk bitmap blocks are not initialized. This is
+generally the case for an empty block group or a block group containing
+only fixed-location block group metadata. The INODE_ZEROED flag means
+that the inode table has been initialized; mkfs will unset this flag and
+rely on the kernel to initialize the inode tables in the background.
+
+By not writing zeroes to the bitmaps and inode table, mkfs time is
+reduced considerably. Note the feature flag is RO_COMPAT_GDT_CSUM,
+but the dumpe2fs output prints this as “uninit_bg”. They are the same
+thing.
+
+Special inodes
+--------------
+
+ext4 reserves some inode for special features, as follows:
+
+.. list-table::
+ :widths: 6 70
+ :header-rows: 1
+
+ * - inode Number
+ - Purpose
+ * - 0
+ - Doesn't exist; there is no inode 0.
+ * - 1
+ - List of defective blocks.
+ * - 2
+ - Root directory.
+ * - 3
+ - User quota.
+ * - 4
+ - Group quota.
+ * - 5
+ - Boot loader.
+ * - 6
+ - Undelete directory.
+ * - 7
+ - Reserved group descriptors inode. (“resize inode”)
+ * - 8
+ - Journal inode.
+ * - 9
+ - The “exclude” inode, for snapshots(?)
+ * - 10
+ - Replica inode, used for some non-upstream feature?
+ * - 11
+ - Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.
+
+Note that there are also some inodes allocated from non-reserved inode numbers
+for other filesystem features which are not referenced from standard directory
+hierarchy. These are generally reference from the superblock. They are:
+
+.. list-table::
+ :widths: 20 50
+ :header-rows: 1
+
+ * - Superblock field
+ - Description
+
+ * - s_lpf_ino
+ - Inode number of lost+found directory.
+ * - s_prj_quota_inum
+ - Inode number of quota file tracking project quotas
+ * - s_orphan_file_inum
+ - Inode number of file tracking orphan inodes.
+
+Block and Inode Allocation Policy
+---------------------------------
+
+ext4 recognizes (better than ext3, anyway) that data locality is
+generally a desirably quality of a filesystem. On a spinning disk,
+keeping related blocks near each other reduces the amount of movement
+that the head actuator and disk must perform to access a data block,
+thus speeding up disk IO. On an SSD there of course are no moving parts,
+but locality can increase the size of each transfer request while
+reducing the total number of requests. This locality may also have the
+effect of concentrating writes on a single erase block, which can speed
+up file rewrites significantly. Therefore, it is useful to reduce
+fragmentation whenever possible.
+
+The first tool that ext4 uses to combat fragmentation is the multi-block
+allocator. When a file is first created, the block allocator
+speculatively allocates 8KiB of disk space to the file on the assumption
+that the space will get written soon. When the file is closed, the
+unused speculative allocations are of course freed, but if the
+speculation is correct (typically the case for full writes of small
+files) then the file data gets written out in a single multi-block
+extent. A second related trick that ext4 uses is delayed allocation.
+Under this scheme, when a file needs more blocks to absorb file writes,
+the filesystem defers deciding the exact placement on the disk until all
+the dirty buffers are being written out to disk. By not committing to a
+particular placement until it's absolutely necessary (the commit timeout
+is hit, or sync() is called, or the kernel runs out of memory), the hope
+is that the filesystem can make better location decisions.
+
+The third trick that ext4 (and ext3) uses is that it tries to keep a
+file's data blocks in the same block group as its inode. This cuts down
+on the seek penalty when the filesystem first has to read a file's inode
+to learn where the file's data blocks live and then seek over to the
+file's data blocks to begin I/O operations.
+
+The fourth trick is that all the inodes in a directory are placed in the
+same block group as the directory, when feasible. The working assumption
+here is that all the files in a directory might be related, therefore it
+is useful to try to keep them all together.
+
+The fifth trick is that the disk volume is cut up into 128MB block
+groups; these mini-containers are used as outlined above to try to
+maintain data locality. However, there is a deliberate quirk -- when a
+directory is created in the root directory, the inode allocator scans
+the block groups and puts that directory into the least heavily loaded
+block group that it can find. This encourages directories to spread out
+over a disk; as the top-level directory/file blobs fill up one block
+group, the allocators simply move on to the next block group. Allegedly
+this scheme evens out the loading on the block groups, though the author
+suspects that the directories which are so unlucky as to land towards
+the end of a spinning drive get a raw deal performance-wise.
+
+Of course if all of these mechanisms fail, one can always use e4defrag
+to defragment files.
+
+Checksums
+---------
+
+Starting in early 2012, metadata checksums were added to all major ext4
+and jbd2 data structures. The associated feature flag is metadata_csum.
+The desired checksum algorithm is indicated in the superblock, though as
+of October 2012 the only supported algorithm is crc32c. Some data
+structures did not have space to fit a full 32-bit checksum, so only the
+lower 16 bits are stored. Enabling the 64bit feature increases the data
+structure size so that full 32-bit checksums can be stored for many data
+structures. However, existing 32-bit filesystems cannot be extended to
+enable 64bit mode, at least not without the experimental resize2fs
+patches to do so.
+
+Existing filesystems can have checksumming added by running
+``tune2fs -O metadata_csum`` against the underlying device. If tune2fs
+encounters directory blocks that lack sufficient empty space to add a
+checksum, it will request that you run ``e2fsck -D`` to have the
+directories rebuilt with checksums. This has the added benefit of
+removing slack space from the directory files and rebalancing the htree
+indexes. If you _ignore_ this step, your directories will not be
+protected by a checksum!
+
+The following table describes the data elements that go into each type
+of checksum. The checksum function is whatever the superblock describes
+(crc32c as of October 2013) unless noted otherwise.
+
+.. list-table::
+ :widths: 20 8 50
+ :header-rows: 1
+
+ * - Metadata
+ - Length
+ - Ingredients
+ * - Superblock
+ - __le32
+ - The entire superblock up to the checksum field. The UUID lives inside
+ the superblock.
+ * - MMP
+ - __le32
+ - UUID + the entire MMP block up to the checksum field.
+ * - Extended Attributes
+ - __le32
+ - UUID + the entire extended attribute block. The checksum field is set to
+ zero.
+ * - Directory Entries
+ - __le32
+ - UUID + inode number + inode generation + the directory block up to the
+ fake entry enclosing the checksum field.
+ * - HTREE Nodes
+ - __le32
+ - UUID + inode number + inode generation + all valid extents + HTREE tail.
+ The checksum field is set to zero.
+ * - Extents
+ - __le32
+ - UUID + inode number + inode generation + the entire extent block up to
+ the checksum field.
+ * - Bitmaps
+ - __le32 or __le16
+ - UUID + the entire bitmap. Checksums are stored in the group descriptor,
+ and truncated if the group descriptor size is 32 bytes (i.e. ^64bit)
+ * - Inodes
+ - __le32
+ - UUID + inode number + inode generation + the entire inode. The checksum
+ field is set to zero. Each inode has its own checksum.
+ * - Group Descriptors
+ - __le16
+ - If metadata_csum, then UUID + group number + the entire descriptor;
+ else if gdt_csum, then crc16(UUID + group number + the entire
+ descriptor). In all cases, only the lower 16 bits are stored.
+
+Bigalloc
+--------
+
+At the moment, the default size of a block is 4KiB, which is a commonly
+supported page size on most MMU-capable hardware. This is fortunate, as
+ext4 code is not prepared to handle the case where the block size
+exceeds the page size. However, for a filesystem of mostly huge files,
+it is desirable to be able to allocate disk blocks in units of multiple
+blocks to reduce both fragmentation and metadata overhead. The
+bigalloc feature provides exactly this ability.
+
+The bigalloc feature (EXT4_FEATURE_RO_COMPAT_BIGALLOC) changes ext4 to
+use clustered allocation, so that each bit in the ext4 block allocation
+bitmap addresses a power of two number of blocks. For example, if the
+file system is mainly going to be storing large files in the 4-32
+megabyte range, it might make sense to set a cluster size of 1 megabyte.
+This means that each bit in the block allocation bitmap now addresses
+256 4k blocks. This shrinks the total size of the block allocation
+bitmaps for a 2T file system from 64 megabytes to 256 kilobytes. It also
+means that a block group addresses 32 gigabytes instead of 128 megabytes,
+also shrinking the amount of file system overhead for metadata.
+
+The administrator can set a block cluster size at mkfs time (which is
+stored in the s_log_cluster_size field in the superblock); from then
+on, the block bitmaps track clusters, not individual blocks. This means
+that block groups can be several gigabytes in size (instead of just
+128MiB); however, the minimum allocation unit becomes a cluster, not a
+block, even for directories. TaoBao had a patchset to extend the “use
+units of clusters instead of blocks” to the extent tree, though it is
+not clear where those patches went-- they eventually morphed into
+“extent tree v2” but that code has not landed as of May 2015.
+
+Inline Data
+-----------
+
+The inline data feature was designed to handle the case that a file's
+data is so tiny that it readily fits inside the inode, which
+(theoretically) reduces disk block consumption and reduces seeks. If the
+file is smaller than 60 bytes, then the data are stored inline in
+``inode.i_block``. If the rest of the file would fit inside the extended
+attribute space, then it might be found as an extended attribute
+“system.data” within the inode body (“ibody EA”). This of course
+constrains the amount of extended attributes one can attach to an inode.
+If the data size increases beyond i_block + ibody EA, a regular block
+is allocated and the contents moved to that block.
+
+Pending a change to compact the extended attribute key used to store
+inline data, one ought to be able to store 160 bytes of data in a
+256-byte inode (as of June 2015, when i_extra_isize is 28). Prior to
+that, the limit was 156 bytes due to inefficient use of inode space.
+
+The inline data feature requires the presence of an extended attribute
+for “system.data”, even if the attribute value is zero length.
+
+Inline Directories
+~~~~~~~~~~~~~~~~~~
+
+The first four bytes of i_block are the inode number of the parent
+directory. Following that is a 56-byte space for an array of directory
+entries; see ``struct ext4_dir_entry``. If there is a “system.data”
+attribute in the inode body, the EA value is an array of
+``struct ext4_dir_entry`` as well. Note that for inline directories, the
+i_block and EA space are treated as separate dirent blocks; directory
+entries cannot span the two.
+
+Inline directory entries are not checksummed, as the inode checksum
+should protect all inline data contents.
+
+Large Extended Attribute Values
+-------------------------------
+
+To enable ext4 to store extended attribute values that do not fit in the
+inode or in the single extended attribute block attached to an inode,
+the EA_INODE feature allows us to store the value in the data blocks of
+a regular file inode. This “EA inode” is linked only from the extended
+attribute name index and must not appear in a directory entry. The
+inode's i_atime field is used to store a checksum of the xattr value;
+and i_ctime/i_version store a 64-bit reference count, which enables
+sharing of large xattr values between multiple owning inodes. For
+backward compatibility with older versions of this feature, the
+i_mtime/i_generation *may* store a back-reference to the inode number
+and i_generation of the **one** owning inode (in cases where the EA
+inode is not referenced by multiple inodes) to verify that the EA inode
+is the correct one being accessed.
+
+Verity files
+------------
+
+ext4 supports fs-verity, which is a filesystem feature that provides
+Merkle tree based hashing for individual readonly files. Most of
+fs-verity is common to all filesystems that support it; see
+:ref:`Documentation/filesystems/fsverity.rst <fsverity>` for the
+fs-verity documentation. However, the on-disk layout of the verity
+metadata is filesystem-specific. On ext4, the verity metadata is
+stored after the end of the file data itself, in the following format:
+
+- Zero-padding to the next 65536-byte boundary. This padding need not
+ actually be allocated on-disk, i.e. it may be a hole.
+
+- The Merkle tree, as documented in
+ :ref:`Documentation/filesystems/fsverity.rst
+ <fsverity_merkle_tree>`, with the tree levels stored in order from
+ root to leaf, and the tree blocks within each level stored in their
+ natural order.
+
+- Zero-padding to the next filesystem block boundary.
+
+- The verity descriptor, as documented in
+ :ref:`Documentation/filesystems/fsverity.rst <fsverity_descriptor>`,
+ with optionally appended signature blob.
+
+- Zero-padding to the next offset that is 4 bytes before a filesystem
+ block boundary.
+
+- The size of the verity descriptor in bytes, as a 4-byte little
+ endian integer.
+
+Verity inodes have EXT4_VERITY_FL set, and they must use extents, i.e.
+EXT4_EXTENTS_FL must be set and EXT4_INLINE_DATA_FL must be clear.
+They can have EXT4_ENCRYPT_FL set, in which case the verity metadata
+is encrypted as well as the data itself.
+
+Verity files cannot have blocks allocated past the end of the verity
+metadata.
+
+Verity and DAX are not compatible and attempts to set both of these flags
+on a file will fail.
+
+Atomic Block Writes
+-------------------
+
+Introduction
+~~~~~~~~~~~~
+
+Atomic (untorn) block writes ensure that either the entire write is committed
+to disk or none of it is. This prevents "torn writes" during power loss or
+system crashes. The ext4 filesystem supports atomic writes (only with Direct
+I/O) on regular files with extents, provided the underlying storage device
+supports hardware atomic writes. This is supported in the following two ways:
+
+1. **Single-fsblock Atomic Writes**:
+ ext4 supports atomic write operations with a single filesystem block since
+ v6.13. In this the atomic write unit minimum and maximum sizes are both set
+ to filesystem blocksize.
+ e.g. doing atomic write of 16KB with 16KB filesystem blocksize on 64KB
+ pagesize system is possible.
+
+2. **Multi-fsblock Atomic Writes with Bigalloc**:
+ ext4 now also supports atomic writes spanning multiple filesystem blocks
+ using a feature known as bigalloc. The atomic write unit's minimum and
+ maximum sizes are determined by the filesystem block size and cluster size,
+ based on the underlying device’s supported atomic write unit limits.
+
+Requirements
+~~~~~~~~~~~~
+
+Basic requirements for atomic writes in ext4:
+
+ 1. The extents feature must be enabled (default for ext4)
+ 2. The underlying block device must support atomic writes
+ 3. For single-fsblock atomic writes:
+
+ 1. A filesystem with appropriate block size (up to the page size)
+ 4. For multi-fsblock atomic writes:
+
+ 1. The bigalloc feature must be enabled
+ 2. The cluster size must be appropriately configured
+
+.. note::
+ ext4 does not support software or COW based atomic write, which means
+ atomic writes on ext4 are only supported if underlying storage device
+ supports it.
+
+Multi-fsblock Implementation Details
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The bigalloc feature changes ext4 to allocate in units of multiple filesystem
+blocks, also known as clusters. With bigalloc each bit within block bitmap
+represents cluster (power of 2 number of blocks) rather than individual
+filesystem blocks.
+
+ext4 supports multi-fsblock atomic writes with bigalloc, subject to the
+following constraints. The minimum atomic write size is the larger of the fs
+block size and the minimum hardware atomic write unit; and the maximum atomic
+write size is smaller of the bigalloc cluster size and the maximum hardware
+atomic write unit. Bigalloc ensures that all allocations are aligned to the
+cluster size, which satisfies the LBA alignment requirements of the hardware
+device if the start of the partition/logical volume is itself aligned correctly.
+
+Here is the block allocation strategy in bigalloc for atomic writes:
+
+ * For regions with fully mapped extents, no additional work is needed
+ * For append writes, a new mapped extent is allocated
+ * For regions that are entirely holes, unwritten extent is created
+ * For large unwritten extents, the extent gets split into two unwritten
+ extents of appropriate requested size
+ * For mixed mapping regions (combinations of holes, unwritten extents, or
+ mapped extents), ext4_map_blocks() is called in a loop with
+ EXT4_GET_BLOCKS_ZERO flag to convert the region into a single contiguous
+ mapped extent by writing zeroes to it and converting any unwritten extents to
+ written, if found within the range.
+
+.. note::
+ Writing on a single contiguous underlying extent, whether mapped or
+ unwritten, is not inherently problematic. However, writing to a mixed mapping
+ region (i.e. one containing a combination of mapped and unwritten extents)
+ must be avoided when performing atomic writes.
+
+The reason is that, atomic writes when issued via pwritev2() with the RWF_ATOMIC
+flag, requires that either all data is written or none at all. In the event of
+a system crash or unexpected power loss during the write operation, the affected
+region (when later read) must reflect either the complete old data or the
+complete new data, but never a mix of both.
+
+To enforce this guarantee, we ensure that the write target is backed by
+a single, contiguous extent before any data is written. This is critical because
+ext4 defers the conversion of unwritten extents to written extents until the I/O
+completion path (typically in ->end_io()). If a write is allowed to proceed over
+a mixed mapping region (with mapped and unwritten extents) and a failure occurs
+mid-write, the system could observe partially updated regions after reboot, i.e.
+new data over mapped areas, and stale (old) data over unwritten extents that
+were never marked written. This violates the atomicity and/or torn write
+prevention guarantee.
+
+To prevent such torn writes, ext4 proactively allocates a single contiguous
+extent for the entire requested region in ``ext4_iomap_alloc`` via
+``ext4_map_blocks_atomic()``. ext4 also force commits the current journalling
+transaction in case if allocation is done over mixed mapping. This ensures any
+pending metadata updates (like unwritten to written extents conversion) in this
+range are in consistent state with the file data blocks, before performing the
+actual write I/O. If the commit fails, the whole I/O must be aborted to prevent
+from any possible torn writes.
+Only after this step, the actual data write operation is performed by the iomap.
+
+Handling Split Extents Across Leaf Blocks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+There can be a special edge case where we have logically and physically
+contiguous extents stored in separate leaf nodes of the on-disk extent tree.
+This occurs because on-disk extent tree merges only happens within the leaf
+blocks except for a case where we have 2-level tree which can get merged and
+collapsed entirely into the inode.
+If such a layout exists and, in the worst case, the extent status cache entries
+are reclaimed due to memory pressure, ``ext4_map_blocks()`` may never return
+a single contiguous extent for these split leaf extents.
+
+To address this edge case, a new get block flag
+``EXT4_GET_BLOCKS_QUERY_LEAF_BLOCKS flag`` is added to enhance the
+``ext4_map_query_blocks()`` lookup behavior.
+
+This new get block flag allows ``ext4_map_blocks()`` to first check if there is
+an entry in the extent status cache for the full range.
+If not present, it consults the on-disk extent tree using
+``ext4_map_query_blocks()``.
+If the located extent is at the end of a leaf node, it probes the next logical
+block (lblk) to detect a contiguous extent in the adjacent leaf.
+
+For now only one additional leaf block is queried to maintain efficiency, as
+atomic writes are typically constrained to small sizes
+(e.g. [blocksize, clustersize]).
+
+
+Handling Journal transactions
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To support multi-fsblock atomic writes, we ensure enough journal credits are
+reserved during:
+
+ 1. Block allocation time in ``ext4_iomap_alloc()``. We first query if there
+ could be a mixed mapping for the underlying requested range. If yes, then we
+ reserve credits of up to ``m_len``, assuming every alternate block can be
+ an unwritten extent followed by a hole.
+
+ 2. During ``->end_io()`` call, we make sure a single transaction is started for
+ doing unwritten-to-written conversion. The loop for conversion is mainly
+ only required to handle a split extent across leaf blocks.
+
+How to
+~~~~~~
+
+Creating Filesystems with Atomic Write Support
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+First check the atomic write units supported by block device.
+See :ref:`atomic_write_bdev_support` for more details.
+
+For single-fsblock atomic writes with a larger block size
+(on systems with block size < page size):
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with a 16KB block size
+ # (requires page size >= 16KB)
+ mkfs.ext4 -b 16384 /dev/device
+
+For multi-fsblock atomic writes with bigalloc:
+
+.. code-block:: bash
+
+ # Create an ext4 filesystem with bigalloc and 64KB cluster size
+ mkfs.ext4 -F -O bigalloc -b 4096 -C 65536 /dev/device
+
+Where ``-b`` specifies the block size, ``-C`` specifies the cluster size in bytes,
+and ``-O bigalloc`` enables the bigalloc feature.
+
+Application Interface
+^^^^^^^^^^^^^^^^^^^^^
+
+Applications can use the ``pwritev2()`` system call with the ``RWF_ATOMIC`` flag
+to perform atomic writes:
+
+.. code-block:: c
+
+ pwritev2(fd, iov, iovcnt, offset, RWF_ATOMIC);
+
+The write must be aligned to the filesystem's block size and not exceed the
+filesystem's maximum atomic write unit size.
+See ``generic_atomic_write_valid()`` for more details.
+
+``statx()`` system call with ``STATX_WRITE_ATOMIC`` flag can provides following
+details:
+
+ * ``stx_atomic_write_unit_min``: Minimum size of an atomic write request.
+ * ``stx_atomic_write_unit_max``: Maximum size of an atomic write request.
+ * ``stx_atomic_write_segments_max``: Upper limit for segments. The number of
+ separate memory buffers that can be gathered into a write operation
+ (e.g., the iovcnt parameter for IOV_ITER). Currently, this is always set to one.
+
+The STATX_ATTR_WRITE_ATOMIC flag in ``statx->attributes`` is set if atomic
+writes are supported.
+
+.. _atomic_write_bdev_support:
+
+Hardware Support
+~~~~~~~~~~~~~~~~
+
+The underlying storage device must support atomic write operations.
+Modern NVMe and SCSI devices often provide this capability.
+The Linux kernel exposes this information through sysfs:
+
+* ``/sys/block/<device>/queue/atomic_write_unit_min`` - Minimum atomic write size
+* ``/sys/block/<device>/queue/atomic_write_unit_max`` - Maximum atomic write size
+
+Nonzero values for these attributes indicate that the device supports
+atomic writes.
+
+See Also
+~~~~~~~~
+
+* Support for atomic block writes in 6.13:
+ https://lwn.net/Articles/1009298/
diff --git a/Documentation/filesystems/ext4/special_inodes.rst b/Documentation/filesystems/ext4/special_inodes.rst
deleted file mode 100644
index fc0636901fa0e1..00000000000000
--- a/Documentation/filesystems/ext4/special_inodes.rst
+++ /dev/null
@@ -1,55 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Special inodes
---------------
-
-ext4 reserves some inode for special features, as follows:
-
-.. list-table::
- :widths: 6 70
- :header-rows: 1
-
- * - inode Number
- - Purpose
- * - 0
- - Doesn't exist; there is no inode 0.
- * - 1
- - List of defective blocks.
- * - 2
- - Root directory.
- * - 3
- - User quota.
- * - 4
- - Group quota.
- * - 5
- - Boot loader.
- * - 6
- - Undelete directory.
- * - 7
- - Reserved group descriptors inode. (“resize inode”)
- * - 8
- - Journal inode.
- * - 9
- - The “exclude” inode, for snapshots(?)
- * - 10
- - Replica inode, used for some non-upstream feature?
- * - 11
- - Traditional first non-reserved inode. Usually this is the lost+found directory. See s_first_ino in the superblock.
-
-Note that there are also some inodes allocated from non-reserved inode numbers
-for other filesystem features which are not referenced from standard directory
-hierarchy. These are generally reference from the superblock. They are:
-
-.. list-table::
- :widths: 20 50
- :header-rows: 1
-
- * - Superblock field
- - Description
-
- * - s_lpf_ino
- - Inode number of lost+found directory.
- * - s_prj_quota_inum
- - Inode number of quota file tracking project quotas
- * - s_orphan_file_inum
- - Inode number of file tracking orphan inodes.
diff --git a/Documentation/filesystems/ext4/verity.rst b/Documentation/filesystems/ext4/verity.rst
deleted file mode 100644
index e99ff3fd09f7e7..00000000000000
--- a/Documentation/filesystems/ext4/verity.rst
+++ /dev/null
@@ -1,44 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-Verity files
-------------
-
-ext4 supports fs-verity, which is a filesystem feature that provides
-Merkle tree based hashing for individual readonly files. Most of
-fs-verity is common to all filesystems that support it; see
-:ref:`Documentation/filesystems/fsverity.rst <fsverity>` for the
-fs-verity documentation. However, the on-disk layout of the verity
-metadata is filesystem-specific. On ext4, the verity metadata is
-stored after the end of the file data itself, in the following format:
-
-- Zero-padding to the next 65536-byte boundary. This padding need not
- actually be allocated on-disk, i.e. it may be a hole.
-
-- The Merkle tree, as documented in
- :ref:`Documentation/filesystems/fsverity.rst
- <fsverity_merkle_tree>`, with the tree levels stored in order from
- root to leaf, and the tree blocks within each level stored in their
- natural order.
-
-- Zero-padding to the next filesystem block boundary.
-
-- The verity descriptor, as documented in
- :ref:`Documentation/filesystems/fsverity.rst <fsverity_descriptor>`,
- with optionally appended signature blob.
-
-- Zero-padding to the next offset that is 4 bytes before a filesystem
- block boundary.
-
-- The size of the verity descriptor in bytes, as a 4-byte little
- endian integer.
-
-Verity inodes have EXT4_VERITY_FL set, and they must use extents, i.e.
-EXT4_EXTENTS_FL must be set and EXT4_INLINE_DATA_FL must be clear.
-They can have EXT4_ENCRYPT_FL set, in which case the verity metadata
-is encrypted as well as the data itself.
-
-Verity files cannot have blocks allocated past the end of the verity
-metadata.
-
-Verity and DAX are not compatible and attempts to set both of these flags
-on a file will fail.
--
An old man doll... just what I always wanted! - Clara
Powered by blists - more mailing lists