[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250924193712.GD19231@mit.edu>
Date: Wed, 24 Sep 2025 15:37:12 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Zeno Endemann <zeno.endemann@...lbox.org>
Cc: linux-ext4@...r.kernel.org
Subject: Re: ext4: Question about directory entry minor hash usage
(documentation error?)
On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote:
>
> The documentation of hash tree directories claims that interior nodes are
> "indexed by a minor hash".
Yes, that's incorrect.
> However from my current understanding of the code, it seems to me the node
> splitting works somewhat like regular B-trees, and there is no re-sorting
> with a minor hash going on. The minor hash doesn't influence at all the
> on-disk data structure, and is only used for sorting in a kernel internal
> rb-tree. Is this correct? If so, I could offer to write up a patch for the
> documentation.
The hash is used as the directory cookie for the puroses of telldir(2)
and seekdir(2) on 32-bit systems. On 64-bit systems, we use the hash
plus the minor_hash to form a 64-bit cookie, or "directory offset".
This is also used for nfsv2 (which suports a 32-bit directory cookie),
ad nfsv3 (which supports a 64-bit directory cookie).
> As a side question, I was wondering a bit why the kernel differentiates
> between htree-indexed dirs and others when simply iterating over it (as in
> e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one
> would want to iterate over the entries in hash tree order.
The reason for this is POSIX requirements for how readdir(), telldir()
and seekdir works. When your use readdir() it must return each
directory entry once and only once. And if while you are calling
readdir() a directory entry gets deleted or added, the added or
removed directory entry must be returned zero or one times, and all
other directory entries exactly once. If you use telldir() to save a
position in the readdirt() stream, and then go back to it using
seekdir(), the rules of "exactly once unless entry is added/removed in
which case it is returned zero or one times" apply.
If you are using a linear directory structure, like the old V7 Unix,
this works fine. But if you are usig a b-tree, where an interior node
might get split, with half of the directory entries getting moved to a
newly allocated block, the POSIX readdir() semantics are extremely
difficult to implement *unless* you interate over the directory in
hash tree order.
(There are other solutions, but they are much more complicated.)
- Ted
Powered by blists - more mailing lists