[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <fb756477-b6b8-4b4a-be63-3a5f18241166@mailbox.org>
Date: Thu, 25 Sep 2025 17:58:55 +0200
From: Zeno Endemann <zeno.endemann@...lbox.org>
To: Theodore Ts'o <tytso@....edu>
Cc: linux-ext4@...r.kernel.org
Subject: Re: ext4: Question about directory entry minor hash usage
(documentation error?)
Theodore Ts'o wrote on 24.09.25 21:37:
> On Wed, Sep 24, 2025 at 06:04:08PM +0200, Zeno Endemann wrote:
>>
>> The documentation of hash tree directories claims that interior nodes are
>> "indexed by a minor hash".
>
> Yes, that's incorrect.
>
>> However from my current understanding of the code, it seems to me the node
>> splitting works somewhat like regular B-trees, and there is no re-sorting
>> with a minor hash going on. The minor hash doesn't influence at all the
>> on-disk data structure, and is only used for sorting in a kernel internal
>> rb-tree. Is this correct? If so, I could offer to write up a patch for the
>> documentation.
>
> The hash is used as the directory cookie for the puroses of telldir(2)
> and seekdir(2) on 32-bit systems. On 64-bit systems, we use the hash
> plus the minor_hash to form a 64-bit cookie, or "directory offset".
> This is also used for nfsv2 (which suports a 32-bit directory cookie),
> ad nfsv3 (which supports a 64-bit directory cookie).
>
>> As a side question, I was wondering a bit why the kernel differentiates
>> between htree-indexed dirs and others when simply iterating over it (as in
>> e.g. ext4_readdir), and what the point of that rb-tree there is, i.e. why one
>> would want to iterate over the entries in hash tree order.
>
> The reason for this is POSIX requirements for how readdir(), telldir()
> and seekdir works. When your use readdir() it must return each
> directory entry once and only once. And if while you are calling
> readdir() a directory entry gets deleted or added, the added or
> removed directory entry must be returned zero or one times, and all
> other directory entries exactly once. If you use telldir() to save a
> position in the readdirt() stream, and then go back to it using
> seekdir(), the rules of "exactly once unless entry is added/removed in
> which case it is returned zero or one times" apply.
>
> If you are using a linear directory structure, like the old V7 Unix,
> this works fine. But if you are usig a b-tree, where an interior node
> might get split, with half of the directory entries getting moved to a
> newly allocated block, the POSIX readdir() semantics are extremely
> difficult to implement *unless* you interate over the directory in
> hash tree order.
>
> (There are other solutions, but they are much more complicated.)
I see, that makes sense. Thanks for the explanation.
Powered by blists - more mailing lists