[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <EA1BCCB1-AEC7-4193-823B-6230B6860C0C@gmail.com>
Date: Sat, 18 Mar 2017 20:17:55 +0300
From: Alexey Lyashkov <alexey.lyashkov@...il.com>
To: Theodore Ts'o <tytso@....edu>
Cc: Andreas Dilger <adilger@...ger.ca>,
Artem Blagodarenko <artem.blagodarenko@...il.com>,
linux-ext4 <linux-ext4@...r.kernel.org>,
Yang Sheng <yang.sheng@...el.com>,
Zhen Liang <liang.zhen@...el.com>,
Artem Blagodarenko <artem.blagodarenko@...gate.com>
Subject: Re: [PATCH] Add largedir feature
> 18 марта 2017 г., в 19:29, Theodore Ts'o <tytso@....edu> написал(а):
>
> On Sat, Mar 18, 2017 at 11:16:26AM +0300, Alexey Lyashkov wrote:
>> Andreas,
>>
>> it not about a feature flag. It’s about a situation in whole.
>> Yes, we may increase a directory size, but it open a number a large problems.
>
>> 1) readdir. It tries to read all entries in memory before send to
>> the user. currently it may eats 20*10^6 * 256 so several gigs, so
>> increasing it size may produce a problems for a system.
>
> That's not true. We normally only read in one block a time. If there
> is a hash collision, then we may need to insert into the rbtree in a
> subsequent block's worth of dentries to make sure we have all of the
> directory entries corresponding to a particular hash value. I think
> you misunderstood the code.
As i see it not about hash collisions, but about merging a several blocks into same hash range on up level hash entry.
so if we have a large hash range originally assigned to the single block, all that range will read at memory at single step.
With «aged» directory when hash blocks used already - it’s easy to hit.
>
>> 2) inode allocation. Current code tries to allocate an inode as near as possible to the directory inode, but one GD may hold 32k entries only, so increase a directory size will use up 1k GD for now and more than it, after it. It increase a seek time with file allocation. It was i mean when say - «dramatically decrease a file creation rate».
>
> But there are also only 32k blocks in a group descriptor, and we try
> to keep the blocks allocated close to the inode.
with bigalloc feature it’s not a 32k blocks. but 32k clusters with 1M cluster size(as example), it very large space.
> So if you are using
> a huge directory, and you are using a storage device with a
> significant seek penalty, then yes, no matter what as the directory
> grows, the time to iterate over all of the files does grow. But there
> is more to life than microbenchmarks which creates huge numbers of zero
> length files! If we assume that the files are going to need to
> contain _data_, and the data blocks should be close to the inodes,
> then there are going to be some performance impacts no matter what.
>
Yes, i expect to have some seek penalty. But may testing say it too huge now.
directory creation rate started with 80k create/s have dropped to the 20k-30k create/s with hash tree extend to the level 3.
Same testing with hard links same create rate dropped slightly.
>> 3) current limit with 4G inodes - currently 32-128 directories may eats a full inode number space. From it perspective large dir don’t need to be used.
>
> I can imagine a new feature flag which defines the use a 64-bit inode
> number, but that's more for people who are creating a file system that
> takes advantage of 64-bit block numbers, and they are intending on
> using all of that space to store small (< 4k or < 8k) files.
>
> And it's also true that there are huge advantges to using a
> multi-level directory hierarchy --- e.g.:
>
> .git/objects/03/08e42105258d4e53ffeb81ffb2a4b2480bb8b8
>
> or even
>
> .git/objects/03/08/e42105258d4e53ffeb81ffb2a4b2480bb8b8
>
> instead of:
>
> .git/objects/0308e42105258d4e53ffeb81ffb2a4b2480bb8b8
>
> but that's an application level question. If for some reason some
> silly application programmer wants to have a single gargantuan
> directory, if the patches to support it are fairly simple, even if
> someone is going to give us patches to do something more general,
> burning an extra feature flag doesn't seem like the most terrible
> thing in the world.
From other side - application don’t expect to have very slow directory and have access with some constant or near or it speed.
>
> As for the other optimizations --- things like allowing parallel
> directory modifications, or being able to shrink empty directory
> blocks or shorten the tree are all improvements we can make without
> impacting the on-disk format. So they aren't an argument for halting
> the submission of the new on-disk format, no?
>
It’s argument about using this feature. Yes, we can land it, but it decrease an expected speed in some cases.
Alexey
Powered by blists - more mailing lists