linux-ext4 - Re: [PATCH] Add largedir feature

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20170318162953.ubn3lvglxqq6ux2e@thunk.org>
Date:   Sat, 18 Mar 2017 12:29:53 -0400
From:   Theodore Ts'o <tytso@....edu>
To:     Alexey Lyashkov <alexey.lyashkov@...il.com>
Cc:     Andreas Dilger <adilger@...ger.ca>,
        Artem Blagodarenko <artem.blagodarenko@...il.com>,
        linux-ext4 <linux-ext4@...r.kernel.org>,
        Yang Sheng <yang.sheng@...el.com>,
        Zhen Liang <liang.zhen@...el.com>,
        Artem Blagodarenko <artem.blagodarenko@...gate.com>
Subject: Re: [PATCH] Add largedir feature

On Sat, Mar 18, 2017 at 11:16:26AM +0300, Alexey Lyashkov wrote:
> Andreas,
> 
> it not about a feature flag. It’s about a situation in whole.
> Yes, we may increase a directory size, but it open a number a large problems.

> 1) readdir. It tries to read all entries in memory before send to
> the user. currently it may eats 20*10^6 * 256 so several gigs, so
> increasing it size may produce a problems for a system.

That's not true.  We normally only read in one block a time.  If there
is a hash collision, then we may need to insert into the rbtree in a
subsequent block's worth of dentries to make sure we have all of the
directory entries corresponding to a particular hash value.  I think
you misunderstood the code.

> 2) inode allocation. Current code tries to allocate an inode as near as possible to the directory inode, but one GD may hold 32k entries only, so increase a directory size will use up 1k GD for now and more than it, after it. It increase a seek time with file allocation. It was i mean when say - «dramatically decrease a file creation rate».

But there are also only 32k blocks in a group descriptor, and we try
to keep the blocks allocated close to the inode.  So if you are using
a huge directory, and you are using a storage device with a
significant seek penalty, then yes, no matter what as the directory
grows, the time to iterate over all of the files does grow.  But there
is more to life than microbenchmarks which creates huge numbers of zero
length files!  If we assume that the files are going to need to
contain _data_, and the data blocks should be close to the inodes,
then there are going to be some performance impacts no matter what.

> 3) current limit with 4G inodes - currently 32-128 directories may eats a full inode number space. From it perspective large dir don’t need to be used.

I can imagine a new feature flag which defines the use a 64-bit inode
number, but that's more for people who are creating a file system that
takes advantage of 64-bit block numbers, and they are intending on
using all of that space to store small (< 4k or < 8k) files.

And it's also true that there are huge advantges to using a
multi-level directory hierarchy --- e.g.:

.git/objects/03/08e42105258d4e53ffeb81ffb2a4b2480bb8b8

or even

.git/objects/03/08/e42105258d4e53ffeb81ffb2a4b2480bb8b8

instead of:

.git/objects/0308e42105258d4e53ffeb81ffb2a4b2480bb8b8

but that's an application level question.  If for some reason some
silly application programmer wants to have a single gargantuan
directory, if the patches to support it are fairly simple, even if
someone is going to give us patches to do something more general,
burning an extra feature flag doesn't seem like the most terrible
thing in the world.

As for the other optimizations --- things like allowing parallel
directory modifications, or being able to shrink empty directory
blocks or shorten the tree are all improvements we can make without
impacting the on-disk format.  So they aren't an argument for halting
the submission of the new on-disk format, no?

    	       	      	  	  	  - Ted