lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Tue, 16 Feb 2021 15:36:23 -0700
From:   Andreas Dilger <>
To:     Благодаренко Артём 
Cc:     Harshad Shirwadkar <>,, "Theodore Y. Ts'o" <>,
        Alex Zhuravlev <>,
        Shuichi Ihara <>
Subject: Re: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning

On Feb 16, 2021, at 12:39 PM, Благодаренко Артём <> wrote:
> Thanks for this useful optimisation.
> Some comments bellow.
>> On 9 Feb 2021, at 23:28, Harshad Shirwadkar <> wrote:
>> Instead of traversing through groups linearly, scan groups in specific
>> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
>> largest free order >= the order of the request. So, with this patch,
>> we maintain lists for each possible order and insert each group into a
>> list based on the largest free order in its buddy bitmap. During cr 0
>> allocation, we traverse these lists in the increasing order of largest
>> free orders. This allows us to find a group with the best available cr
>> 0 match in constant time. If nothing can be found, we fallback to cr 1
>> immediately.
>> At CR1, the story is slightly different. We want to traverse in the
>> order of increasing average fragment size. For CR1, we maintain a rb
>> tree of groupinfos which is sorted by average fragment size. Instead
>> of traversing linearly, at CR1, we traverse in the order of increasing
>> average fragment size, starting at the most optimal group. This brings
>> down cr 1 search complexity to log(num groups).
>> For cr >= 2, we just perform the linear search as before. Also, in
>> case of lock contention, we intermittently fallback to linear search
>> even in CR 0 and CR 1 cases. This allows us to proceed during the
>> allocation path even in case of high contention.
>> There is an opportunity to do optimization at CR2 too. That's because
>> at CR2 we only consider groups where bb_free counter (number of free
>> blocks) is greater than the request extent size. That's left as future
>> work.
>> +static int
>> +ext4_mb_avg_fragment_size_cmp(struct rb_node *rb1, struct rb_node *rb2)
>> +{
>> +	struct ext4_group_info *grp1 = rb_entry(rb1,
>> +						struct ext4_group_info,
>> +						bb_avg_fragment_size_rb);
>> +	struct ext4_group_info *grp2 = rb_entry(rb2,
>> +						struct ext4_group_info,
>> +						bb_avg_fragment_size_rb);
>> +	int num_frags_1, num_frags_2;
>> +
>> +	num_frags_1 = grp1->bb_fragments ?
>> +		grp1->bb_free / grp1->bb_fragments : 0;
>> +	num_frags_2 = grp2->bb_fragments ?
>> +		grp2->bb_free / grp2->bb_fragments : 0;
>> +
>> +	return (num_frags_1 < num_frags_2);
>> +}
>> +
>> +/*
>> + * Reinsert grpinfo into the avg_fragment_size tree with new average
>> + * fragment size.
>> + */
> Walk along the ngroups linked elements in worst case for every mb_free_blocks and mb_mark_used which are quite frequently executed actions.
> If double-linked list is used for avg_fragments this function will make this change without iterating through the list:
> 1. Check with previous element. If smaller, then commute
> 2. Check with next element. If greater, then commute.

I was wondering about the cost of the list/tree maintenance as well,
especially since there was a post from "kernel test robot" that this
patch introduced a performance regression.

The tree insertion/removal overhead I think Artem's proposal above would
improve, since it may be that a group will not move in the tree much?

It would also make sense for totally full groups to be kept out of the
rb tree entirely, since they do not provide any value in that case (the
full groups will never be selected for allocations), and they just add
to the tree depth and potentially cause an imbalance if there are many
of them.  That also has the benefit of the rbtree efficiency *improving*
as the filesystem gets more full, which is right when it is most needed.

It might also make sense to keep totally empty groups out of the rbtree,
since they should always be found in cr0 already if the allocation is
large enough to fill the whole group?  Having a smaller rbtree makes
every insertion/removal that much more efficient.

Those groups will naturally be re-added into the rbtree when they have
blocks freed or allocated, so not much added complexity.

Does it make sense to disable "mb_optimize_scan" if filesystems are
smaller than a certain threshold?  Clearly, if there are only 1-2
groups, maintaining a list and rbtree has no real value, and with
only a handful of groups (< 16?) linear searching is probably as fast
or faster than maintaining the two data structures.  That is similar
to e.g. bubble sort vs. quicksort, where it is more efficient to sort
a list of ~5-8 entries with a dumb/fast algorithm instead of a complex
algorithm that is more efficient at larger scales.  That would also
(likely) quiet the kernel test robot, if we think that its testing is
not representative of real-world usage.

> On Feb 11, 2021, at 3:30 AM, Andreas Dilger <> wrote:

>> This function would be more efficient to do the list move under a single
>> write lock if the order doesn't change.  The order loop would just
>> save the largest free order, then grab the write lock, do the list_del(),
>> set bb_largest_free_order, and list_add_tail():
>> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
>> {
>> 	struct ext4_sb_info *sbi = EXT4_SB(sb);
>> 	int i, new_order = -1;
>> 	for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
>> 		if (grp->bb_counters[i] > 0) {
>> 			new_order = i;
>> 			break;
>> 		}
>> 	}
>> 	if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
>> 		write_lock(&sbi->s_mb_largest_free_orders_locks[
>> 					      grp->bb_largest_free_order]);
>> 		list_del_init(&grp->bb_largest_free_order_node);
>> 		if (new_order != grp->bb_largest_free_order) {
>> 			write_unlock(&sbi->s_mb_largest_free_orders_locks[
>> 					      grp->bb_largest_free_order]);
>> 			grp->bb_largest_free_order = new_order;
>> 			write_lock(&sbi->s_mb_largest_free_orders_locks[
>> 					      grp->bb_largest_free_order]);
>> 		}
>> 		list_add_tail(&grp->bb_largest_free_order_node,
>> 		      &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
>> 		write_unlock(&sbi->s_mb_largest_free_orders_locks[
>> 					      grp->bb_largest_free_order]);
>> 	}
>> }

In looking at my previous comment, I wonder if we could further reduce
the list locking here by not moving an entry to the end of the *same*
list if it is not currently at the head?  Since it was (presumably)
just moved to the end of the list by a recent allocation, it is very
likely that some other group will be chosen from the list head, so
moving within the list to maintain strict LRU is probably just extra
locking overhead that can be avoided...

Also, it isn't clear if *freeing* blocks from a group should move it
to the end of the same list, or just leave it as-is?  If there are
more frees from the list it is likely to be added to a new list soon,
and if there are no more frees, then it could stay in the same order.

Cheers, Andreas

Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)

Powered by blists - more mailing lists