[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <18FAA61C-DE3C-42C7-A8A4-BB2CDD3C5D24@gmail.com>
Date: Sat, 18 Mar 2017 11:16:26 +0300
From: Alexey Lyashkov <alexey.lyashkov@...il.com>
To: Andreas Dilger <adilger@...ger.ca>
Cc: Artem Blagodarenko <artem.blagodarenko@...il.com>,
linux-ext4 <linux-ext4@...r.kernel.org>,
Yang Sheng <yang.sheng@...el.com>,
Zhen Liang <liang.zhen@...el.com>,
Artem Blagodarenko <artem.blagodarenko@...gate.com>
Subject: Re: [PATCH] Add largedir feature
Andreas,
it not about a feature flag. It’s about a situation in whole.
Yes, we may increase a directory size, but it open a number a large problems.
1) readdir. It tries to read all entries in memory before send to the user. currently it may eats 20*10^6 * 256 so several gigs, so increasing it size may produce a problems for a system.
2) inode allocation. Current code tries to allocate an inode as near as possible to the directory inode, but one GD may hold 32k entries only, so increase a directory size will use up 1k GD for now and more than it, after it. It increase a seek time with file allocation. It was i mean when say - «dramatically decrease a file creation rate».
3) current limit with 4G inodes - currently 32-128 directories may eats a full inode number space. From it perspective large dir don’t need to be used.
Other improvements you point already.
> 17 марта 2017 г., в 23:51, Andreas Dilger <adilger@...ger.ca> написал(а):
>
> On Mar 17, 2017, at 12:15 AM, Alexey Lyashkov <alexey.lyashkov@...il.com> wrote:
>>
>> Andreas,
>>
>> I not strongly against it patch, but my testing say - 3 level h-tree need a more than 20mil files in directory, and creation rate was dropped dramatically,
>> a specially without parallel dir operations, Did we need do some other research changes with landing it patch as disk format will changed anyway by incompatible way ?
>
> Hi Alexey,
> I'm not against further improvements to large directory handling (e.g.
> directory shrinking, implement pdirops through VFS, etc.) for ext4.
> That said, anything that changes the disk format should use a different
> feature flag than EXT4_FEATURE_INCOMPAT_LARGEDIR, since we have been
> using this flag in the Lustre code for several years already.
>
> I think for very large directories that directory shrinking can be quite
> useful, and could actually be implemented without a format change since
> htree always masks off the top byte ("fullness ratio") of the logical block
> number for this reason, and removing leaf blocks does not change the other
> parts of the htree on-disk format. This could still be done (in a slightly
> less efficient manner) even without "fullness ratio" (i.e. if fullness == 0,
> unset from older versions of ext4/e2fsck). This would also help performance
> noticeably if a directory has many files deleted, as the htree index will
> shrink, and operations like readdir() will not have to skip empty blocks.
>
> IMHO, I think the 3-level htree maximum limits are as far as we need to
> go for ext4. With 4KB blocks this gives us a maximum directory size
> of about 5B entries, which is more files than can fit in the filesystem,
> Even for 1KB blocks the maximum directory size is about 8M entries, which
> is fine since 1KB blocksize is mostly only used for small filesystems, yet
> is still as good as what a 4KB filesystem can do today.
>
> Cheers, Andreas
>
>>> 17 марта 2017 г., в 0:44, Andreas Dilger <adilger@...ger.ca> написал(а):
>>>
>>> On Mar 16, 2017, at 3:51 AM, Artem Blagodarenko <artem.blagodarenko@...il.com> wrote:
>>>>
>>>> From: Artem Blagodarenko <artem.blagodarenko@...il.com>
>>>>
>>>> This INCOMPAT_LARGEDIR feature allows larger directories to be created
>>>> in ldiskfs, both with directory sizes over 2GB and and a maximum htree
>>>> depth of 3 instead of the current limit of 2. These features are needed
>>>> in order to exceed the current limit of approximately 10M entries in a
>>>> single directory.
>>>
>>> Artem,
>>> many thanks for updating and submitting this, and creating the e2fsck patches.
>>>
>>> Ted,
>>> can you please add the original author for the largedir ext4 patch when
>>> landing this patch: Liang Zhen <liang.zhen@...el.com>
>>>
>>> and these tags which contain more background information on this feature:
>>>
>>> Lustre-change: https://review.hpdd.intel.com/375
>>> Intel-bug-id: https://jira.hpdd.intel.com/browse/LU-50
>>>
>>>> Signed-off-by: Yang Sheng <yang.sheng@...el.com>
>>>> Signed-off-by: Artem Blagodarenko <artem.blagodarenko@...gate.com>
>>>
>>> Reviewed-by: Andreas Dilger <andreas.dilger@...el.com>
>>>
>>>> ---
>>>> fs/ext4/ext4.h | 23 ++++++++---
>>>> fs/ext4/inode.c | 4 +-
>>>> fs/ext4/namei.c | 118 +++++++++++++++++++++++++++++++++++++++-----------------
>>>> 3 files changed, 102 insertions(+), 43 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>> index 01d52b9..0bbbd9b 100644
>>>> --- a/fs/ext4/ext4.h
>>>> +++ b/fs/ext4/ext4.h
>>>> @@ -1799,7 +1799,8 @@ static inline void ext4_clear_state_flags(struct ext4_inode_info *ei)
>>>> EXT4_FEATURE_INCOMPAT_MMP | \
>>>> EXT4_FEATURE_INCOMPAT_INLINE_DATA | \
>>>> EXT4_FEATURE_INCOMPAT_ENCRYPT | \
>>>> - EXT4_FEATURE_INCOMPAT_CSUM_SEED)
>>>> + EXT4_FEATURE_INCOMPAT_CSUM_SEED | \
>>>> + EXT4_FEATURE_INCOMPAT_LARGEDIR)
>>>> #define EXT4_FEATURE_RO_COMPAT_SUPP (EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER| \
>>>> EXT4_FEATURE_RO_COMPAT_LARGE_FILE| \
>>>> EXT4_FEATURE_RO_COMPAT_GDT_CSUM| \
>>>> @@ -2125,6 +2126,16 @@ struct dir_private_info {
>>>> */
>>>> #define ERR_BAD_DX_DIR (-(MAX_ERRNO - 1))
>>>>
>>>> +/* htree levels for ext4 */
>>>> +#define EXT4_HTREE_LEVEL_COMPAT 2
>>>> +#define EXT4_HTREE_LEVEL 3
>>>> +
>>>> +static inline int ext4_dir_htree_level(struct super_block *sb)
>>>> +{
>>>> + return ext4_has_feature_largedir(sb) ?
>>>> + EXT4_HTREE_LEVEL : EXT4_HTREE_LEVEL_COMPAT;
>>>> +}
>>>> +
>>>> /*
>>>> * Timeout and state flag for lazy initialization inode thread.
>>>> */
>>>> @@ -2758,13 +2769,15 @@ static inline void ext4_r_blocks_count_set(struct ext4_super_block *es,
>>>> es->s_r_blocks_count_hi = cpu_to_le32(blk >> 32);
>>>> }
>>>>
>>>> -static inline loff_t ext4_isize(struct ext4_inode *raw_inode)
>>>> +static inline loff_t ext4_isize(struct super_block *sb,
>>>> + struct ext4_inode *raw_inode)
>>>> {
>>>> - if (S_ISREG(le16_to_cpu(raw_inode->i_mode)))
>>>> + if (ext4_has_feature_largedir(sb) ||
>>>> + S_ISREG(le16_to_cpu(raw_inode->i_mode)))
>>>> return ((loff_t)le32_to_cpu(raw_inode->i_size_high) << 32) |
>>>> le32_to_cpu(raw_inode->i_size_lo);
>>>> - else
>>>> - return (loff_t) le32_to_cpu(raw_inode->i_size_lo);
>>>> +
>>>> + return (loff_t) le32_to_cpu(raw_inode->i_size_lo);
>>>> }
>>>>
>>>> static inline void ext4_isize_set(struct ext4_inode *raw_inode, loff_t i_size)
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index f622d4a..5787f3d 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4682,7 +4682,7 @@ struct inode *ext4_iget(struct super_block *sb, unsigned long ino)
>>>> if (ext4_has_feature_64bit(sb))
>>>> ei->i_file_acl |=
>>>> ((__u64)le16_to_cpu(raw_inode->i_file_acl_high)) << 32;
>>>> - inode->i_size = ext4_isize(raw_inode);
>>>> + inode->i_size = ext4_isize(sb, raw_inode);
>>>> if ((size = i_size_read(inode)) < 0) {
>>>> EXT4_ERROR_INODE(inode, "bad i_size value: %lld", size);
>>>> ret = -EFSCORRUPTED;
>>>> @@ -5008,7 +5008,7 @@ static int ext4_do_update_inode(handle_t *handle,
>>>> raw_inode->i_file_acl_high =
>>>> cpu_to_le16(ei->i_file_acl >> 32);
>>>> raw_inode->i_file_acl_lo = cpu_to_le32(ei->i_file_acl);
>>>> - if (ei->i_disksize != ext4_isize(raw_inode)) {
>>>> + if (ei->i_disksize != ext4_isize(inode->i_sb, raw_inode)) {
>>>> ext4_isize_set(raw_inode, ei->i_disksize);
>>>> need_datasync = 1;
>>>> }
>>>> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
>>>> index 6ad612c..3298fe3 100644
>>>> --- a/fs/ext4/namei.c
>>>> +++ b/fs/ext4/namei.c
>>>> @@ -513,7 +513,7 @@ static inline int ext4_handle_dirty_dx_node(handle_t *handle,
>>>>
>>>> static inline ext4_lblk_t dx_get_block(struct dx_entry *entry)
>>>> {
>>>> - return le32_to_cpu(entry->block) & 0x00ffffff;
>>>> + return le32_to_cpu(entry->block) & 0x0fffffff;
>>>> }
>>>
>>> It wouldn't be terrible to add a comment to this function explaining why
>>> the block numbers are masked off, which is to allow the high bits of the
>>> logical block number to hold a "fullness" fraction so that the kernel
>>> could start shrinking directories if many files are deleted. That said,
>>> this isn't a shortcoming of this patch, but a suggestion if the patch is
>>> being resubmitted for some other reason.
>>>
>>>> static inline void dx_set_block(struct dx_entry *entry, ext4_lblk_t value)
>>>> @@ -739,6 +739,7 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
>>>> struct dx_frame *ret_err = ERR_PTR(ERR_BAD_DX_DIR);
>>>> u32 hash;
>>>>
>>>> + memset(frame_in, 0, EXT4_HTREE_LEVEL * sizeof(frame_in[0]));
>>>> frame->bh = ext4_read_dirblock(dir, 0, INDEX);
>>>> if (IS_ERR(frame->bh))
>>>> return (struct dx_frame *) frame->bh;
>>>> @@ -768,9 +769,15 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
>>>> }
>>>>
>>>> indirect = root->info.indirect_levels;
>>>> - if (indirect > 1) {
>>>> - ext4_warning_inode(dir, "Unimplemented hash depth: %#06x",
>>>> - root->info.indirect_levels);
>>>> + if (indirect >= ext4_dir_htree_level(dir->i_sb)) {
>>>> + ext4_warning(dir->i_sb,
>>>> + "Directory (ino: %lu) htree depth %#06x exceed"
>>>> + "supported value", dir->i_ino,
>>>> + ext4_dir_htree_level(dir->i_sb));
>>>> + if (ext4_dir_htree_level(dir->i_sb) < EXT4_HTREE_LEVEL) {
>>>> + ext4_warning(dir->i_sb, "Enable large directory "
>>>> + "feature to access it");
>>>> + }
>>>> goto fail;
>>>> }
>>>>
>>>> @@ -859,12 +866,19 @@ struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
>>>>
>>>> static void dx_release(struct dx_frame *frames)
>>>> {
>>>> + struct dx_root_info *info;
>>>> + int i;
>>>> +
>>>> if (frames[0].bh == NULL)
>>>> return;
>>>>
>>>> - if (((struct dx_root *)frames[0].bh->b_data)->info.indirect_levels)
>>>> - brelse(frames[1].bh);
>>>> - brelse(frames[0].bh);
>>>> + info = &((struct dx_root *)frames[0].bh->b_data)->info;
>>>> + for (i = 0; i <= info->indirect_levels; i++) {
>>>> + if (frames[i].bh == NULL)
>>>> + break;
>>>> + brelse(frames[i].bh);
>>>> + frames[i].bh = NULL;
>>>> + }
>>>> }
>>>>
>>>> /*
>>>> @@ -1050,7 +1064,7 @@ int ext4_htree_fill_tree(struct file *dir_file, __u32 start_hash,
>>>> {
>>>> struct dx_hash_info hinfo;
>>>> struct ext4_dir_entry_2 *de;
>>>> - struct dx_frame frames[2], *frame;
>>>> + struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
>>>> struct inode *dir;
>>>> ext4_lblk_t block;
>>>> int count = 0;
>>>> @@ -1517,7 +1531,7 @@ static struct buffer_head * ext4_dx_find_entry(struct inode *dir,
>>>> struct ext4_dir_entry_2 **res_dir)
>>>> {
>>>> struct super_block * sb = dir->i_sb;
>>>> - struct dx_frame frames[2], *frame;
>>>> + struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
>>>> const struct qstr *d_name = fname->usr_fname;
>>>> struct buffer_head *bh;
>>>> ext4_lblk_t block;
>>>> @@ -1947,7 +1961,7 @@ static int add_dirent_to_buf(handle_t *handle, struct ext4_filename *fname,
>>>> */
>>>> dir->i_mtime = dir->i_ctime = current_time(dir);
>>>> ext4_update_dx_flag(dir);
>>>> - dir->i_version++;
>>>> + inode_inc_iversion(dir);
>>>> ext4_mark_inode_dirty(handle, dir);
>>>> BUFFER_TRACE(bh, "call ext4_handle_dirty_metadata");
>>>> err = ext4_handle_dirty_dirent_node(handle, dir, bh);
>>>> @@ -1966,7 +1980,7 @@ static int make_indexed_dir(handle_t *handle, struct ext4_filename *fname,
>>>> {
>>>> struct buffer_head *bh2;
>>>> struct dx_root *root;
>>>> - struct dx_frame frames[2], *frame;
>>>> + struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
>>>> struct dx_entry *entries;
>>>> struct ext4_dir_entry_2 *de, *de2;
>>>> struct ext4_dir_entry_tail *t;
>>>> @@ -2185,13 +2199,16 @@ static int ext4_add_entry(handle_t *handle, struct dentry *dentry,
>>>> static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>> struct inode *dir, struct inode *inode)
>>>> {
>>>> - struct dx_frame frames[2], *frame;
>>>> + struct dx_frame frames[EXT4_HTREE_LEVEL], *frame;
>>>> struct dx_entry *entries, *at;
>>>> struct buffer_head *bh;
>>>> struct super_block *sb = dir->i_sb;
>>>> struct ext4_dir_entry_2 *de;
>>>> + int restart;
>>>> int err;
>>>>
>>>> +again:
>>>> + restart = 0;
>>>> frame = dx_probe(fname, dir, NULL, frames);
>>>> if (IS_ERR(frame))
>>>> return PTR_ERR(frame);
>>>> @@ -2213,24 +2230,44 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>> if (err != -ENOSPC)
>>>> goto cleanup;
>>>>
>>>> + err = 0;
>>>> /* Block full, should compress but for now just split */
>>>> dxtrace(printk(KERN_DEBUG "using %u of %u node entries\n",
>>>> dx_get_count(entries), dx_get_limit(entries)));
>>>> /* Need to split index? */
>>>> if (dx_get_count(entries) == dx_get_limit(entries)) {
>>>> ext4_lblk_t newblock;
>>>> - unsigned icount = dx_get_count(entries);
>>>> - int levels = frame - frames;
>>>> + int levels = frame - frames + 1;
>>>> + unsigned int icount;
>>>> + int add_level = 1;
>>>> struct dx_entry *entries2;
>>>> struct dx_node *node2;
>>>> struct buffer_head *bh2;
>>>>
>>>> - if (levels && (dx_get_count(frames->entries) ==
>>>> - dx_get_limit(frames->entries))) {
>>>> - ext4_warning_inode(dir, "Directory index full!");
>>>> + while (frame > frames) {
>>>> + if (dx_get_count((frame - 1)->entries) <
>>>> + dx_get_limit((frame - 1)->entries)) {
>>>> + add_level = 0;
>>>> + break;
>>>> + }
>>>> + frame--; /* split higher index block */
>>>> + at = frame->at;
>>>> + entries = frame->entries;
>>>> + restart = 1;
>>>> + }
>>>> + if (add_level && levels == ext4_dir_htree_level(sb)) {
>>>> + ext4_warning(sb, "Directory (ino: %lu) index full, "
>>>> + "reach max htree level :%d",
>>>> + dir->i_ino, levels);
>>>> + if (ext4_dir_htree_level(sb) < EXT4_HTREE_LEVEL) {
>>>> + ext4_warning(sb, "Large directory feature is "
>>>> + "not enabled on this "
>>>> + "filesystem");
>>>> + }
>>>> err = -ENOSPC;
>>>> goto cleanup;
>>>> }
>>>> + icount = dx_get_count(entries);
>>>> bh2 = ext4_append(handle, dir, &newblock);
>>>> if (IS_ERR(bh2)) {
>>>> err = PTR_ERR(bh2);
>>>> @@ -2245,7 +2282,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>> err = ext4_journal_get_write_access(handle, frame->bh);
>>>> if (err)
>>>> goto journal_error;
>>>> - if (levels) {
>>>> + if (!add_level) {
>>>> unsigned icount1 = icount/2, icount2 = icount - icount1;
>>>> unsigned hash2 = dx_get_hash(entries + icount1);
>>>> dxtrace(printk(KERN_DEBUG "Split index %i/%i\n",
>>>> @@ -2253,7 +2290,7 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>>
>>>> BUFFER_TRACE(frame->bh, "get_write_access"); /* index root */
>>>> err = ext4_journal_get_write_access(handle,
>>>> - frames[0].bh);
>>>> + (frame - 1)->bh);
>>>> if (err)
>>>> goto journal_error;
>>>>
>>>> @@ -2269,17 +2306,23 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>> frame->entries = entries = entries2;
>>>> swap(frame->bh, bh2);
>>>> }
>>>> - dx_insert_block(frames + 0, hash2, newblock);
>>>> - dxtrace(dx_show_index("node", frames[1].entries));
>>>> + dx_insert_block((frame - 1), hash2, newblock);
>>>> + dxtrace(dx_show_index("node", frame->entries));
>>>> dxtrace(dx_show_index("node",
>>>> ((struct dx_node *) bh2->b_data)->entries));
>>>> err = ext4_handle_dirty_dx_node(handle, dir, bh2);
>>>> if (err)
>>>> goto journal_error;
>>>> brelse (bh2);
>>>> + ext4_handle_dirty_metadata(handle, dir,
>>>> + (frame - 1)->bh);
>>>> + if (restart) {
>>>> + ext4_handle_dirty_metadata(handle, dir,
>>>> + frame->bh);
>>>> + goto cleanup;
>>>> + }
>>>> } else {
>>>> - dxtrace(printk(KERN_DEBUG
>>>> - "Creating second level index...\n"));
>>>> + struct dx_root *dxroot;
>>>> memcpy((char *) entries2, (char *) entries,
>>>> icount * sizeof(struct dx_entry));
>>>> dx_set_limit(entries2, dx_node_limit(dir));
>>>> @@ -2287,19 +2330,17 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>> /* Set up root */
>>>> dx_set_count(entries, 1);
>>>> dx_set_block(entries + 0, newblock);
>>>> - ((struct dx_root *) frames[0].bh->b_data)->info.indirect_levels = 1;
>>>> -
>>>> - /* Add new access path frame */
>>>> - frame = frames + 1;
>>>> - frame->at = at = at - entries + entries2;
>>>> - frame->entries = entries = entries2;
>>>> - frame->bh = bh2;
>>>> - err = ext4_journal_get_write_access(handle,
>>>> - frame->bh);
>>>> - if (err)
>>>> - goto journal_error;
>>>> + dxroot = (struct dx_root *)frames[0].bh->b_data;
>>>> + dxroot->info.indirect_levels += 1;
>>>> + dxtrace(printk(KERN_DEBUG
>>>> + "Creating %d level index...\n",
>>>> + info->indirect_levels));
>>>> + ext4_handle_dirty_metadata(handle, dir, frame->bh);
>>>> + ext4_handle_dirty_metadata(handle, dir, bh2);
>>>> + brelse(bh2);
>>>> + restart = 1;
>>>> + goto cleanup;
>>>> }
>>>> - err = ext4_handle_dirty_dx_node(handle, dir, frames[0].bh);
>>>> if (err) {
>>>> ext4_std_error(inode->i_sb, err);
>>>> goto cleanup;
>>>> @@ -2318,6 +2359,11 @@ static int ext4_dx_add_entry(handle_t *handle, struct ext4_filename *fname,
>>>> cleanup:
>>>> brelse(bh);
>>>> dx_release(frames);
>>>> + /* @restart is true means htree-path has been changed, we need to
>>>> + * repeat dx_probe() to find out valid htree-path
>>>> + */
>>>> + if (restart && err == 0)
>>>> + goto again;
>>>> return err;
>>>> }
>>>>
>>>> @@ -2354,7 +2400,7 @@ int ext4_generic_delete_entry(handle_t *handle,
>>>> blocksize);
>>>> else
>>>> de->inode = 0;
>>>> - dir->i_version++;
>>>> + inode_inc_iversion(dir);
>>>> return 0;
>>>> }
>>>> i += ext4_rec_len_from_disk(de->rec_len, blocksize);
>>>> --
>>>> 1.8.3.1
>>>>
>>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
>>
>
>
> Cheers, Andreas
>
>
>
>
>
Powered by blists - more mailing lists