[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <F90B563C-B06A-49F3-B276-A7A28D9E6332@dilger.ca>
Date: Thu, 11 Feb 2021 03:30:31 -0700
From: Andreas Dilger <adilger@...ger.ca>
To: Harshad Shirwadkar <harshadshirwadkar@...il.com>
Cc: linux-ext4@...r.kernel.org, tytso@....edu,
Alex Zhuravlev <bzzz@...mcloud.com>,
artem.blagodarenko@...il.com, Shuichi Ihara <sihara@....com>
Subject: Re: [PATCH v2 4/5] ext4: improve cr 0 / cr 1 group scanning
On Feb 9, 2021, at 1:28 PM, Harshad Shirwadkar <harshadshirwadkar@...il.com> wrote:
>
> Instead of traversing through groups linearly, scan groups in specific
> orders at cr 0 and cr 1. At cr 0, we want to find groups that have the
> largest free order >= the order of the request. So, with this patch,
> we maintain lists for each possible order and insert each group into a
> list based on the largest free order in its buddy bitmap. During cr 0
> allocation, we traverse these lists in the increasing order of largest
> free orders. This allows us to find a group with the best available cr
> 0 match in constant time. If nothing can be found, we fallback to cr 1
> immediately.
>
> At CR1, the story is slightly different. We want to traverse in the
> order of increasing average fragment size. For CR1, we maintain a rb
> tree of groupinfos which is sorted by average fragment size. Instead
> of traversing linearly, at CR1, we traverse in the order of increasing
> average fragment size, starting at the most optimal group. This brings
> down cr 1 search complexity to log(num groups).
>
> For cr >= 2, we just perform the linear search as before. Also, in
> case of lock contention, we intermittently fallback to linear search
> even in CR 0 and CR 1 cases. This allows us to proceed during the
> allocation path even in case of high contention.
>
> There is an opportunity to do optimization at CR2 too. That's because
> at CR2 we only consider groups where bb_free counter (number of free
> blocks) is greater than the request extent size. That's left as future
> work.
>
> All the changes introduced in this patch are protected under a new
> mount option "mb_optimize_scan".
>
> Signed-off-by: Harshad Shirwadkar <harshadshirwadkar@...il.com>
> ---
>
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index b7f25120547d..63562f5f42f1 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
>
> +/*
> + * Choose next group by traversing largest_free_order lists. Return 0 if next
> + * group was selected optimally. Return 1 if next group was not selected
> + * optimally. Updates *new_cr if cr level needs an update.
> + */
> +static int ext4_mb_choose_next_group_cr0(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + for (i = ac->ac_2order; i < MB_NUM_ORDERS(ac->ac_sb); i++) {
> + if (list_empty(&sbi->s_mb_largest_free_orders[i]))
> + continue;
> + read_lock(&sbi->s_mb_largest_free_orders_locks[i]);
> + if (list_empty(&sbi->s_mb_largest_free_orders[i])) {
> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + continue;
> + }
> + grp = NULL;
> + list_for_each_entry(iter, &sbi->s_mb_largest_free_orders[i],
> + bb_largest_free_order_node) {
> + /*
> + * Perform this check without a lock, once we lock
> + * the group, we'll perform this check again.
> + */
This comment is no longer correct.
> + if (likely(ext4_mb_good_group(ac, iter->bb_group, 0))) {
> + grp = iter;
> + break;
> + }
> + }
> + read_unlock(&sbi->s_mb_largest_free_orders_locks[i]);
> + if (grp)
> + break;
> + }
> +}
> +/*
> + * ext4_mb_choose_next_group: choose next group for allocation.
> + *
> + * @ac Allocation Context
> + * @new_cr This is an output parameter. If the there is no good group available
> + * at current CR level, this field is updated to indicate the new cr
> + * level that should be used.
> + * @group This is an input / output parameter. As an input it indicates the last
> + * group used for allocation. As output, this field indicates the
> + * next group that should be used.
> + * @ngroups Total number of groups
> + */
> +static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> + int *new_cr, ext4_group_t *group, ext4_group_t ngroups)
> +{
> + int ret;
> +
> + *new_cr = ac->ac_criteria;
> +
> + if (!test_opt2(ac->ac_sb, MB_OPTIMIZE_SCAN) ||
> + *new_cr >= 2 ||
> + !ext4_test_inode_flag(ac->ac_inode, EXT4_INODE_EXTENTS))
> + goto inc_and_return;
I still think it would be beneficial to check if the next group is good
before going to the list/tree. That will reduce lock contention, and
will also avoid needless seeking between groups if possible.
> + if (*new_cr == 0) {
> + ret = ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> + if (ret)
> + goto inc_and_return;
> + }
> + if (*new_cr == 1) {
> + ret = ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> + if (ret)
> + goto inc_and_return;
> + }
> + return;
> +
> +inc_and_return:
> + /*
> + * Artificially restricted ngroups for non-extent
> + * files makes group > ngroups possible on first loop.
> + */
> + *group = *group + 1;
> + if (*group >= ngroups)
> + *group = 0;
> +}
> +
> /*
> * Cache the order of the largest free extent we have available in this block
> * group.
> @@ -751,18 +1001,32 @@ static void ext4_mb_mark_free_simple(struct super_block *sb,
> static void
> mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
> {
> + struct ext4_sb_info *sbi = EXT4_SB(sb);
> int i;
> - int bits;
>
> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + list_del_init(&grp->bb_largest_free_order_node);
> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + }
> grp->bb_largest_free_order = -1; /* uninit */
>
> - bits = MB_NUM_ORDERS(sb) - 1;
> - for (i = bits; i >= 0; i--) {
> + for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
> if (grp->bb_counters[i] > 0) {
> grp->bb_largest_free_order = i;
> break;
> }
> }
> + if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
> + write_lock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + list_add_tail(&grp->bb_largest_free_order_node,
> + &sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
> + write_unlock(&sbi->s_mb_largest_free_orders_locks[
> + grp->bb_largest_free_order]);
> + }
> }
This function would be more efficient to do the list move under a single
write lock if the order doesn't change. The order loop would just
save the largest free order, then grab the write lock, do the list_del(),
set bb_largest_free_order, and list_add_tail():
mb_set_largest_free_order(struct super_block *sb, struct ext4_group_info *grp)
{
struct ext4_sb_info *sbi = EXT4_SB(sb);
int i, new_order = -1;
for (i = MB_NUM_ORDERS(sb) - 1; i >= 0; i--) {
if (grp->bb_counters[i] > 0) {
new_order = i;
break;
}
}
if (test_opt2(sb, MB_OPTIMIZE_SCAN) && grp->bb_largest_free_order >= 0) {
write_lock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
list_del_init(&grp->bb_largest_free_order_node);
if (new_order != grp->bb_largest_free_order) {
write_unlock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
grp->bb_largest_free_order = new_order;
write_lock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
}
list_add_tail(&grp->bb_largest_free_order_node,
&sbi->s_mb_largest_free_orders[grp->bb_largest_free_order]);
write_unlock(&sbi->s_mb_largest_free_orders_locks[
grp->bb_largest_free_order]);
}
}
Cheers, Andreas
Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)
Powered by blists - more mailing lists