[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <153124308013.17949.7356126977659638635.stgit@magnolia>
Date: Tue, 10 Jul 2018 10:18:00 -0700
From: "Darrick J. Wong" <darrick.wong@...cle.com>
To: tytso@....edu, darrick.wong@...cle.com
Cc: linux-ext4@...r.kernel.org
Subject: [PATCH 12/13] ext4: import directory layout chapter from wiki page
From: Darrick J. Wong <darrick.wong@...cle.com>
Import the chapter about directory layout from the on-disk format wiki
page into the kernel documentation.
Signed-off-by: Darrick J. Wong <darrick.wong@...cle.com>
---
.../filesystems/ext4/ondisk/directory.rst | 426 ++++++++++++++++++++
Documentation/filesystems/ext4/ondisk/dynamic.rst | 1
2 files changed, 427 insertions(+)
create mode 100644 Documentation/filesystems/ext4/ondisk/directory.rst
diff --git a/Documentation/filesystems/ext4/ondisk/directory.rst b/Documentation/filesystems/ext4/ondisk/directory.rst
new file mode 100644
index 000000000000..8fcba68c2884
--- /dev/null
+++ b/Documentation/filesystems/ext4/ondisk/directory.rst
@@ -0,0 +1,426 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Directory Entries
+-----------------
+
+In an ext4 filesystem, a directory is more or less a flat file that maps
+an arbitrary byte string (usually ASCII) to an inode number on the
+filesystem. There can be many directory entries across the filesystem
+that reference the same inode number--these are known as hard links, and
+that is why hard links cannot reference files on other filesystems. As
+such, directory entries are found by reading the data block(s)
+associated with a directory file for the particular directory entry that
+is desired.
+
+Linear (Classic) Directories
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+By default, each directory lists its entries in an “almost-linear”
+array. I write “almost” because it's not a linear array in the memory
+sense because directory entries are not split across filesystem blocks.
+Therefore, it is more accurate to say that a directory is a series of
+data blocks and that each block contains a linear array of directory
+entries. The end of each per-block array is signified by reaching the
+end of the block; the last entry in the block has a record length that
+takes it all the way to the end of the block. The end of the entire
+directory is of course signified by reaching the end of the file. Unused
+directory entries are signified by inode = 0. By default the filesystem
+uses ``struct ext4_dir_entry_2`` for directory entries unless the
+“filetype” feature flag is not set, in which case it uses
+``struct ext4_dir_entry``.
+
+The original directory entry format is ``struct ext4_dir_entry``, which
+is at most 263 bytes long, though on disk you'll need to reference
+``dirent.rec_len`` to know for sure.
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Size
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - inode
+ - Number of the inode that this directory entry points to.
+ * - 0x4
+ - \_\_le16
+ - rec\_len
+ - Length of this directory entry. Must be a multiple of 4.
+ * - 0x6
+ - \_\_le16
+ - name\_len
+ - Length of the file name.
+ * - 0x8
+ - char
+ - name[EXT4\_NAME\_LEN]
+ - File name.
+
+Since file names cannot be longer than 255 bytes, the new directory
+entry format shortens the rec\_len field and uses the space for a file
+type flag, probably to avoid having to load every inode during directory
+tree traversal. This format is ``ext4_dir_entry_2``, which is at most
+263 bytes long, though on disk you'll need to reference
+``dirent.rec_len`` to know for sure.
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Size
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - inode
+ - Number of the inode that this directory entry points to.
+ * - 0x4
+ - \_\_le16
+ - rec\_len
+ - Length of this directory entry.
+ * - 0x6
+ - \_\_u8
+ - name\_len
+ - Length of the file name.
+ * - 0x7
+ - \_\_u8
+ - file\_type
+ - File type code, see ftype_ table below.
+ * - 0x8
+ - char
+ - name[EXT4\_NAME\_LEN]
+ - File name.
+
+.. _ftype:
+
+The directory file type is one of the following values:
+
+.. list-table::
+ :widths: 1 79
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 0x0
+ - Unknown.
+ * - 0x1
+ - Regular file.
+ * - 0x2
+ - Directory.
+ * - 0x3
+ - Character device file.
+ * - 0x4
+ - Block device file.
+ * - 0x5
+ - FIFO.
+ * - 0x6
+ - Socket.
+ * - 0x7
+ - Symbolic link.
+
+In order to add checksums to these classic directory blocks, a phony
+``struct ext4_dir_entry`` is placed at the end of each leaf block to
+hold the checksum. The directory entry is 12 bytes long. The inode
+number and name\_len fields are set to zero to fool old software into
+ignoring an apparently empty directory entry, and the checksum is stored
+in the place where the name normally goes. The structure is
+``struct ext4_dir_entry_tail``:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Size
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - det\_reserved\_zero1
+ - Inode number, which must be zero.
+ * - 0x4
+ - \_\_le16
+ - det\_rec\_len
+ - Length of this directory entry, which must be 12.
+ * - 0x6
+ - \_\_u8
+ - det\_reserved\_zero2
+ - Length of the file name, which must be zero.
+ * - 0x7
+ - \_\_u8
+ - det\_reserved\_ft
+ - File type, which must be 0xDE.
+ * - 0x8
+ - \_\_le32
+ - det\_checksum
+ - Directory leaf block checksum.
+
+The leaf directory block checksum is calculated against the FS UUID, the
+directory's inode number, the directory's inode generation number, and
+the entire directory entry block up to (but not including) the fake
+directory entry.
+
+Hash Tree Directories
+~~~~~~~~~~~~~~~~~~~~~
+
+A linear array of directory entries isn't great for performance, so a
+new feature was added to ext3 to provide a faster (but peculiar)
+balanced tree keyed off a hash of the directory entry name. If the
+EXT4\_INDEX\_FL (0x1000) flag is set in the inode, this directory uses a
+hashed btree (htree) to organize and find directory entries. For
+backwards read-only compatibility with ext2, this tree is actually
+hidden inside the directory file, masquerading as “empty” directory data
+blocks! It was stated previously that the end of the linear directory
+entry table was signified with an entry pointing to inode 0; this is
+(ab)used to fool the old linear-scan algorithm into thinking that the
+rest of the directory block is empty so that it moves on.
+
+The root of the tree always lives in the first data block of the
+directory. By ext2 custom, the '.' and '..' entries must appear at the
+beginning of this first block, so they are put here as two
+``struct ext4_dir_entry_2``\ s and not stored in the tree. The rest of
+the root node contains metadata about the tree and finally a hash->block
+map to find nodes that are lower in the htree. If
+``dx_root.info.indirect_levels`` is non-zero then the htree has two
+levels; the data block pointed to by the root node's map is an interior
+node, which is indexed by a minor hash. Interior nodes in this tree
+contains a zeroed out ``struct ext4_dir_entry_2`` followed by a
+minor\_hash->block map to find leafe nodes. Leaf nodes contain a linear
+array of all ``struct ext4_dir_entry_2``; all of these entries
+(presumably) hash to the same value. If there is an overflow, the
+entries simply overflow into the next leaf node, and the
+least-significant bit of the hash (in the interior node map) that gets
+us to this next leaf node is set.
+
+To traverse the directory as a htree, the code calculates the hash of
+the desired file name and uses it to find the corresponding block
+number. If the tree is flat, the block is a linear array of directory
+entries that can be searched; otherwise, the minor hash of the file name
+is computed and used against this second block to find the corresponding
+third block number. That third block number will be a linear array of
+directory entries.
+
+To traverse the directory as a linear array (such as the old code does),
+the code simply reads every data block in the directory. The blocks used
+for the htree will appear to have no entries (aside from '.' and '..')
+and so only the leaf nodes will appear to have any interesting content.
+
+The root of the htree is in ``struct dx_root``, which is the full length
+of a data block:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - dot.inode
+ - inode number of this directory.
+ * - 0x4
+ - \_\_le16
+ - dot.rec\_len
+ - Length of this record, 12.
+ * - 0x6
+ - u8
+ - dot.name\_len
+ - Length of the name, 1.
+ * - 0x7
+ - u8
+ - dot.file\_type
+ - File type of this entry, 0x2 (directory) (if the feature flag is set).
+ * - 0x8
+ - char
+ - dot.name[4]
+ - “.\\0\\0\\0”
+ * - 0xC
+ - \_\_le32
+ - dotdot.inode
+ - inode number of parent directory.
+ * - 0x10
+ - \_\_le16
+ - dotdot.rec\_len
+ - block\_size - 12. The record length is long enough to cover all htree
+ data.
+ * - 0x12
+ - u8
+ - dotdot.name\_len
+ - Length of the name, 2.
+ * - 0x13
+ - u8
+ - dotdot.file\_type
+ - File type of this entry, 0x2 (directory) (if the feature flag is set).
+ * - 0x14
+ - char
+ - dotdot\_name[4]
+ - “..\\0\\0”
+ * - 0x18
+ - \_\_le32
+ - struct dx\_root\_info.reserved\_zero
+ - Zero.
+ * - 0x1C
+ - u8
+ - struct dx\_root\_info.hash\_version
+ - Hash type, see dirhash_ table below.
+ * - 0x1D
+ - u8
+ - struct dx\_root\_info.info\_length
+ - Length of the tree information, 0x8.
+ * - 0x1E
+ - u8
+ - struct dx\_root\_info.indirect\_levels
+ - Depth of the htree. Cannot be larger than 3 if the INCOMPAT\_LARGEDIR
+ feature is set; cannot be larger than 2 otherwise.
+ * - 0x1F
+ - u8
+ - struct dx\_root\_info.unused\_flags
+ -
+ * - 0x20
+ - \_\_le16
+ - limit
+ - Maximum number of dx\_entries that can follow this header, plus 1 for
+ the header itself.
+ * - 0x22
+ - \_\_le16
+ - count
+ - Actual number of dx\_entries that follow this header, plus 1 for the
+ header itself.
+ * - 0x24
+ - \_\_le32
+ - block
+ - The block number (within the directory file) that goes with hash=0.
+ * - 0x28
+ - struct dx\_entry
+ - entries[0]
+ - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
+
+.. _dirhash:
+
+The directory hash is one of the following values:
+
+.. list-table::
+ :widths: 1 79
+ :header-rows: 1
+
+ * - Value
+ - Description
+ * - 0x0
+ - Legacy.
+ * - 0x1
+ - Half MD4.
+ * - 0x2
+ - Tea.
+ * - 0x3
+ - Legacy, unsigned.
+ * - 0x4
+ - Half MD4, unsigned.
+ * - 0x5
+ - Tea, unsigned.
+
+Interior nodes of an htree are recorded as ``struct dx_node``, which is
+also the full length of a data block:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - fake.inode
+ - Zero, to make it look like this entry is not in use.
+ * - 0x4
+ - \_\_le16
+ - fake.rec\_len
+ - The size of the block, in order to hide all of the dx\_node data.
+ * - 0x6
+ - u8
+ - name\_len
+ - Zero. There is no name for this “unused” directory entry.
+ * - 0x7
+ - u8
+ - file\_type
+ - Zero. There is no file type for this “unused” directory entry.
+ * - 0x8
+ - \_\_le16
+ - limit
+ - Maximum number of dx\_entries that can follow this header, plus 1 for
+ the header itself.
+ * - 0xA
+ - \_\_le16
+ - count
+ - Actual number of dx\_entries that follow this header, plus 1 for the
+ header itself.
+ * - 0xE
+ - \_\_le32
+ - block
+ - The block number (within the directory file) that goes with the lowest
+ hash value of this block. This value is stored in the parent block.
+ * - 0x12
+ - struct dx\_entry
+ - entries[0]
+ - As many 8-byte ``struct dx_entry`` as fits in the rest of the data block.
+
+The hash maps that exist in both ``struct dx_root`` and
+``struct dx_node`` are recorded as ``struct dx_entry``, which is 8 bytes
+long:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - \_\_le32
+ - hash
+ - Hash code.
+ * - 0x4
+ - \_\_le32
+ - block
+ - Block number (within the directory file, not filesystem blocks) of the
+ next node in the htree.
+
+(If you think this is all quite clever and peculiar, so does the
+author.)
+
+If metadata checksums are enabled, the last 8 bytes of the directory
+block (precisely the length of one dx\_entry) are used to store a
+``struct dx_tail``, which contains the checksum. The ``limit`` and
+``count`` entries in the dx\_root/dx\_node structures are adjusted as
+necessary to fit the dx\_tail into the block. If there is no space for
+the dx\_tail, the user is notified to run e2fsck -D to rebuild the
+directory index (which will ensure that there's space for the checksum.
+The dx\_tail structure is 8 bytes long and looks like this:
+
+.. list-table::
+ :widths: 1 1 1 77
+ :header-rows: 1
+
+ * - Offset
+ - Type
+ - Name
+ - Description
+ * - 0x0
+ - u32
+ - dt\_reserved
+ - Zero.
+ * - 0x4
+ - \_\_le32
+ - dt\_checksum
+ - Checksum of the htree directory block.
+
+The checksum is calculated against the FS UUID, the htree index header
+(dx\_root or dx\_node), all of the htree indices (dx\_entry) that are in
+use, and the tail block (dx\_tail).
diff --git a/Documentation/filesystems/ext4/ondisk/dynamic.rst b/Documentation/filesystems/ext4/ondisk/dynamic.rst
index f090de8dd1c1..f2f14822b0f5 100644
--- a/Documentation/filesystems/ext4/ondisk/dynamic.rst
+++ b/Documentation/filesystems/ext4/ondisk/dynamic.rst
@@ -8,3 +8,4 @@ allocated to files.
.. include:: inodes.rst
.. include:: ifork.rst
+.. include:: directory.rst
Powered by blists - more mailing lists