linux-ext4 - Re: Regarding random grouop search start for allocation of inode.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <9260CD06-FEF8-4312-84CE-DA5053979297@dilger.ca>
Date:	Tue, 15 Dec 2015 14:32:52 -0700
From:	Andreas Dilger <adilger@...ger.ca>
To:	lokesh jaliminche <lokesh.jaliminche@...il.com>
Cc:	linux-ext4@...r.kernel.org
Subject: Re: Regarding random grouop search start for allocation of inode.

On Dec 15, 2015, at 3:33 AM, lokesh jaliminche <lokesh.jaliminche@...il.com> wrote:
> 
> "No, this isn't correct.  This loop is looking for the *best* group it
> can find, and your "break" would have it exit the loop as soon as the
> *first* group that matches the conditions is found."
> 
> But as we are checking  all the groups with the same conditions then
> how it guarantees better group selection ? As per my understanding  we
> are just wasting time in looping.

The important part of the loop is ensuring that the selected group is always
improving over the previous one:

			if (le16_to_cpu(desc->bg_used_dirs_count) >= best_ndir)
                                continue;

The "best_ndir" value tracks for the best group found so far the number of
used directories in the group, and if the new directory has fewer directories
than the  previous "best" directory and still meets all the other criteria
(fewer than average inodes allocated, etc) then the new group will be chosen.

That said, you are correct that the loop can spend a lot of time searching
needlessly.  It would be trivial to add a check at the start of the loop:

			/* the group can't get any better than empty */
			if (desc->bg_free_inodes_count == inodes_per_group &&
			    desc->bg_free_blocks_count ==
						EXT4_BLOCKS_PER_GROUP(sb))
				goto found;

This jumps directly to "found" with "desc" and "group" already set, so
there is no need to go through the extra steps of setting "best_desc" and
"best_group" and then break out of the loop just to set "desc" and "group"
again.

Since you are the one to find this optimization, could you please submit a
patch to this effect so you get the credit.

Cheers, Andreas

> On Sat, Dec 5, 2015 at 3:24 AM, Andreas Dilger <adilger@...ger.ca> wrote:
>> 
>>> On Dec 3, 2015, at 12:13 PM, lokesh jaliminche <lokesh.jaliminche@...il.com> wrote:
>>> 
>>> Ohh thanks for the clarification. There is one more thing I would like
>>> to point out here.
>>> In the code there  is a loop to scan the groups for inode
>>> alllocation(Inside find_group_orlov function).
>>> There are some policies for group selection . while scanning the
>>> groups, it checks for these
>>> policies to be satisfied.
>>> If a particular group satisfy these properties it should get selected
>>> for inode allocation but instead
>>> it does further lookup in next groups.
>>> I think there is missing breaking condition. I have added break over
>>> there and here is the
>>> patch for that. Any reason for not having break condition over here ?
>>> 
>>> diff -Nur linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>> linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c
>>> --- linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c    2014-04-12
>>> 01:20:31.000000000 +0530
>>> +++ linux-2.6.32-431.17.1.el6.x86_64/fs/ext4/ialloc.c    2015-11-29
>>> 21:36:51.805542209 +0530
>>> @@ -529,6 +529,7 @@
>>>            grp = g;
>>>            ret = 0;
>>>            best_ndir = stats.used_dirs;
>>> +            break;
>>>        }
>>>        if (ret)
>>>        goto fallback;
>> 
>> No, this isn't correct.  This loop is looking for the *best* group it can find,
>> and your "break" would have it exit the loop as soon as the *first* directory
>> that matches the conditions is found.  Since those conditions are fairly weak,
>> for example that the group actually has free inodes, and it has better than
>> average free inodes and blocks, it makes sense to search beyond just the first
>> matching group.
>> 
>> That said, it also doesn't make sense to search beyond a "perfect" group that
>> has no allocated inodes and no allocated blocks, so a break condition could be
>> added to this loop and make it more efficient, especially for very large
>> filesystems that have 128k+ groups.
>> 
>> It should be noted that this part of the algorithm only applies to "top level"
>> directories (those below the root inode, or with the EXT4_INODE_TOPDIR flag
>> set, so searching a bit longer for a good group is not a bad idea in this case.
>> 
>> Cheers, Andreas.
>> 
>>> Thanks & Regards,
>>>  Lokesh
>>> 
>>> 
>>> 
>>> On Thu, Dec 3, 2015 at 11:28 PM, Andreas Dilger <adilger@...ger.ca> wrote:
>>>> On Dec 3, 2015, at 01:07, lokesh jaliminche <lokesh.jaliminche@...il.com> wrote:
>>>>> 
>>>>> Thought of giving more clarification on my question
>>>>> why group search start is random ? because we can also start  search
>>>>> for valid groups for inode allocation from the start. As this group
>>>>> search is random  inode selection might go to end of groups which
>>>>> might affect IO performance
>>>> 
>>>> Starting the inode search at the beginning of the disk each time
>>>> means that inode allocation will be inefficient because it will search
>>>> over groups that are mostly or entirely full already.
>>>> 
>>>> Allocating the new directory in a semi-random group, one that is
>>>> relatively unused, ensures that new
>>>> inode and block allocations are relatively efficient afterward.
>>>> 
>>>> Cheers, Andreas
>>>> 
>>>>> On Thu, Dec 3, 2015 at 1:14 PM, lokesh jaliminche
>>>>> <lokesh.jaliminche@...il.com> wrote:
>>>>>> hello folks,
>>>>>>              I am new to ext4 code. I was going through the
>>>>>> ext4-source for allocation of inode.
>>>>>> There is one thing that I did not understand while selection of groups
>>>>>> for inode allocation . I came across this code snippet which is part
>>>>>> of find_group_orlov function. question is, why group search start is
>>>>>> random ?
>>>>>> 
>>>>>> Code snippet:
>>>>>> ==========
>>>>>> В·В·В·if (qstr) {
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·hinfo.hash_version = LDISKFS_DX_HASH_HALF_MD4;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·hinfo.seed = sbi->s_hash_seed;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·ldiskfsfs_dirhash(qstr->name, qstr->len, &hinfo);
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·grp = hinfo.hash;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·} else
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·get_random_bytes(&grp, sizeof(grp));
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·parent_group = (unsigned)grp % ngroups;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·for (i = 0; i < ngroups; i++) {
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·g = (parent_group + i) % ngroups;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·get_orlov_stats(sb, g, flex_size, &stats);
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (!stats.free_inodes)
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (stats.used_dirs >= best_ndir)
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (stats.free_inodes < avefreei)
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·if (stats.free_blocks < avefreeb)
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·continue;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·grp = g;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·ret = 0;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·best_ndir = stats.used_dirs;
>>>>>> В»В·В·В·В·В·В·В·В»В·В·В·В·В·В·В·}
>>>>>> 
>>>>>> Thanks & Regards,
>>>>>> Lokesh
>>>>> N‹§Іжмrё›yъиљШbІX¬¶З§vШ^–)Ює{.nЗ+‰·ҐЉ{±{ xЉ{ayє К‡Ъ™л,j ўfЈў·hљ‹аz№ ®wҐўё ў·¦j:+v‰ЁЉwиjШm¶џяѕ «‘кзzZ+ѓщљЋЉЭўj"ќъ!¶i
>> 
>> 
>> Cheers, Andreas
>> 
>> 
>> 
>> 
>> 


Cheers, Andreas






Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)