[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <ZRWXyYhk+Im5cz4E@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date: Thu, 28 Sep 2023 20:42:09 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Ritesh Harjani <ritesh.list@...il.com>
Cc: Andreas Dilger <adilger@...ger.ca>, Bobi Jam <bobijam@...mail.com>,
Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v3] ext4: optimize metadata allocation for hybrid LUNs
On Wed, Sep 27, 2023 at 08:18:28AM +0530, Ritesh Harjani wrote:
> Andreas Dilger <adilger@...ger.ca> writes:
>
> > On Sep 25, 2023, at 9:35 PM, Ritesh Harjani (IBM) <ritesh.list@...il.com> wrote:
> >>
> >> Andreas Dilger <adilger@...ger.ca> writes:
> >>
> >>> On Sep 19, 2023, at 11:39 PM, Ritesh Harjani (IBM) <ritesh.list@...il.com> wrote:
> >>>>
> >>>> Bobi Jam <bobijam@...mail.com> writes:
> >>>>
> >>>>> Non-metadata block allocation does not allocate from the IOPS groups
> >>>>> if non-IOPS groups are not used up.
> >>>>
> >>>>> Add for mke2fs an option to mark which blocks are in the IOPS region
> >>>>> of storage at format time:
> >>>>>
> >>>>> -E iops=0-1024G,4096-8192G
> >>>>>
> >>>>> so the ext4 mballoc code can then use the EXT4_BG_IOPS flag in the
> >>>>> group descriptors to decide which groups to allocate dynamic
> >>>>> filesystem metadata.
> >>>>>
> >>>>> Signed-off-by: Bobi Jam <bobijam@...mail.com
> >>>>>
> >>>>> --
> >>>>> v2->v3: add sysfs mb_enable_iops_data to enable data block allocation
> >>>>> from IOPS groups.
> >>>>> v1->v2: for metadata block allocation, search in IOPS list then normal
> >>>>> list.
> >>>>> ---
> >>>>>
> >>
> >>>>> @@ -1009,11 +1108,37 @@ static void ext4_mb_choose_next_group(struct ext4_allocation_context *ac,
> >>>>> return;
> >>>>> }
> >>>>>
> >>>>> + if (alloc_metadata && sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS) {
> >>>>> + if (*new_cr == 0)
> >>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac, group);
> >>>>> + if (!ret && *new_cr < 2)
> >>>>> + ret = ext4_mb_choose_next_iops_group_cr1(ac, group);
> >>>
> >>> It looks like this patch is a bit stale since Ojaswin's renaming of the
> >>> cr0/cr1 phases to "p2_aligned" and "goal_fast" and "best_avail" names.
> >>>
> >>
> >> Yup. We should rebase our development effort to latest tree.
> >>
> >>>> This is a bit confusing here. Say if *new_cr = 0 fails, then we return
> >>>> ret = false and fallback to choosing xx_iops_group_cr1(). And say if we
> >>>> were able to find a group which satisfies this allocation request we
> >>>> return. But the caller never knows that we allocated using cr1 and not
> >>>> cr0. Because we never updated *new_cr inside xx_iops_group_crX()
> >>>
> >>> I guess it is a bit messy, since updating new_cr might interfere with the
> >>> use of new_cr in the fallthrough to the non-IOPS groups below? In the
> >>> "1% IOPS groups" case, doing this extra scan for metadata blocks should
> >>> be very fast, since the metadata allocations are almost always one block
> >>> (excluding large xattrs), so the only time this would fail is if no IOPS
> >>> blocks are free, in which case it is fast since the group lists are empty.
> >>>
> >>
> >> What I was suggesting was we never update *new_cr value when we were
> >> able to find a suitable group. That will confuse the two for loops we
> >> have in the caller. We might as well just update the *new_cr value once
> >> we have identified a suitable group in cr0 or cr1 before returning.
> >>
> >>> We _could_ have a separate (in effect) cr0_3 and cr0_6 phases before the
> >>> non-IOPS group allocation starts to be able to distinguish these cases
> >>> (i.e. skip IOPS group scans if they are full) and the fallthrough search
> >>> is also having trouble to find a single free block for the metadata, but
> >>> I think that is pretty unlikely.
> >
> > I'm not clear which option you prefer here?
> > - update *new_cr based on the scan in the IOPS groups (in which case the
> > fallback to non-IOPS groups would start at a higher CR than necessary)
>
> This obviously won't help, since it will even further slowdown the metdata
> allocations when the iops group becomes full.
>
>
> > - add new phases *before* CR_POWER2_ALIGNED, e.g. "CR_IOPS_POWER2_ALIGNED"
> > "CR_IOPS_CR_GOAL_LEN_FAST" and "CR_IOPS_CR_GOAL_LEN_SLOW" to do either
> > a fast scan or a slow scan on the IOPS groups, and then fall back to
> > non-IOPS groups? They would be skipped if no IOPS groups exist.
> >
> > The second option allows preserving the CR value across the loops, in case
> > the group returned is not suitable for some reason, without confusing whether
> > the lookup is being done for IOPS groups or not. Also, it makes sense to
> > have a "SLOW" pass on the IOPS groups, instead of just "FAST", to ensure
> > that all of the IOPS groups have been tried. This should be very rare
> > since most allocations (excluding xattrs) are only one block long.
>
> So how about we add a scan_state to allocation criteria for iops v/s
> non-iops scanning. This way we don't add any new crs, but mballoc now
> does a stateful scan rather than stateless scan being done today.
> (because we now have two different type of groups to select, iops v/s
> non-iops).
>
> We can add a function like....
> ac->scan_state = ext4_mb_get_scan_state_policy()
>
> ...before starting the scan.
>
> For metadata allocations it will return IOPS_SCAN and for data
> allocations it can return NON_IOPS_SCAN (more policy decisions can be
> added later). This way we don't have to add extra CRs to the enum
> criteria. The allocation still happens using the same existing
> allocation criteria. If say the allocation using IOPS_SCAN for metadata
> fails, we can switch to NON_IOPS_SCAN & restart the scan from cr0 again.
>
> We might also need to maintain some extra state or variable which tells us
> which previous scan_state failed to find a suitable group. But I
> will leave that as implementation details. It might be useful for data
> allocations where we might have a policy to start with either IOPS_SCAN
> or NON_IOPS_SCAN first.
>
> I believe this will be a simpler change rather than adding more crs.
> Problem in adding more crs at the beginning could be when we have to
> roll over from NON_IOPS criteria to IOPS allocation criteria.
>
> Thoughts?
>
Hi Ritesh, Andreas.
So I'm not sure if I'm convinced with the way we are handling CRs in
current implementation. This changes the behavior of how other
ext4_mb_choose_*_cr0/1() functions work that is by updating the cr if we fail
to find a suitable group. The new functions instead return true or false
without updating the CR which, as Ritesh pointed out, leaves some
unhandled edge cases that could cause silent behavior changes.
I like the original way of handling CRs where the logic to update
the CR is mostly in the ext4_mb_choose_*_cr0/1() functions and I'd rather not
move it in the ext4_mb_choose_next_group() like in this patch.
That being said, @Andreas, my thoughts on the 2 ways you've proposed:
> > - update *new_cr based on the scan in the IOPS groups (in which case the
> > fallback to non-IOPS groups would start at a higher CR than necessary)
>
1 So the way I see it, each CR represents the algorithm
to find a good group rather than the data structures we use for them.
Hence, I feel that since the algo of CR0/1 remains the same whether its
IOPS or not, we should not add a new CR for IOPS.
> > - update *new_cr based on the scan in the IOPS groups (in which case the
> > fallback to non-IOPS groups would start at a higher CR than necessary)
2. I would actually prefer this way of doing it. I think this is also
somewhat similar to how we were doing it in PATCH v1. We should keep
using the existing ext4_mb_choose_*_cr0/1() functions but update the CR
based on if its a metadata allocation or not.
I actually like Ritesh's proposal of using a scan state which will then
lead to our psuedo-code looking something like:
ext4_mb_choose_next_group_cr0()
new_cr = 1
ext4_mb_choose_next_group_cr1()
new_cr = 2
/* cr 2 search in outer loop */
if (not found in cr 2) {
if (ac->scan_state == IOPS_SCAN) {
ac->scan_state == NON_IOPS_SCAN;
new_cr = 0;
} else {
ac->scan_state == IOPS_SCAN;
new_cr = 0;
}
goto repeat;
}
With the patch I was working on to shift CR2 to order list [1] and then
have a ext4_mb_choose_next_group_cr2() functions instead of using the
linear loop, we can then maintain an IOPS lists like we do for cr0/1
here and further simplify the above psuedo code for CR2.
[1]
https://lore.kernel.org/linux-ext4/cover.1693911548.git.ojaswin@linux.ibm.com/
(In the above discussion i used the older cr0/1/2 notation for
simplicity)
Also, I didn't completely understand this particular statement:
> in which case the fallback to non-IOPS groups would start at a higher
> CR than necessary
I think we can always reset the cr to 0 when we reach the end in IOPS
allocation right like in the psuedo code above, or am I missing
something?
Regards,
ojaswin
> >
> > Ojaswin, do you have any input here? You've been doing somework on the
> > mballoc code recently, and it would be good to get this aligned with what
> > you are doing/planning.
> >
> >>>>> if (*new_cr == 0) {
> >>>>> ext4_mb_choose_next_group_cr0(ac, new_cr, group, ngroups);
> >>>>> } else if (*new_cr == 1) {
> >>>>> ext4_mb_choose_next_group_cr1(ac, new_cr, group, ngroups);
> >>>>> } else {
> >>>>> + /*
> >>>>> + * Cannot get data group from slow storage, try IOPS storage
> >>>>> + */
> >>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> >>>>> + !alloc_metadata && sbi->s_mb_enable_iops_data &&
> >>>>> + *new_cr == 3) {
> >>>>> + if (ac->ac_2order)
> >>>>> + ret = ext4_mb_choose_next_iops_group_cr0(ac,
> >>>>> + group);
> >>>>> + if (!ret)
> >>>>> + ext4_mb_choose_next_iops_group_cr1(ac, group);
> >>>>> + }
> >>>>
> >>>> We might never come here in this else case because
> >>>> should_optimize_scan() which we check in the beginning of this function
> >>>> will return 0 and we will chose a next linear group for CR >= 2.
> >>>
> >>> Hmm, you are right. Just off the bottom of this hunk is a "WARN_ON(1)"
> >>> that this code block should never be entered.
> >>
> >> right.
> >>
> >>>
> >>> Really, the fallback to IOPS groups for regular files should only happen
> >>> in case of if *new_cr >= CR_GOAL_ANY_FREE. We don't want "normal" block
> >>> allocation to fill up the IOPS groups just because the filesystem is
> >>> fragmented and low on space, but only if out of non-IOPS space.
> >>>
> >>
> >> Sure, I have added some comments later on this policy part...
> >>
> >>>>>
> >>>>> @@ -2498,6 +2629,10 @@ static int ext4_mb_good_group_nolock(struct ext4_allocation_context *ac,
> >>>>> goto out;
> >>>>> if (unlikely(EXT4_MB_GRP_BBITMAP_CORRUPT(grp)))
> >>>>> goto out;
> >>>>> + if (sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS &&
> >>>>> + (ac->ac_flags & EXT4_MB_HINT_DATA) && EXT4_MB_GRP_TEST_IOPS(grp) &&
> >>>>> + !sbi->s_mb_enable_iops_data)
> >>>>> + goto out;
> >>>>
> >>>> since we already have a grp information here. Then checking for s_flags
> >>>> and is redundant here right?
> >>>
> >>> This is intended to stop regular data allocations in IOPS groups that are
> >>> found by next_linear_group().
> >>
> >> What I meant is EXT4_MB_GRP_TEST_IOPS(grp), will only be true when we
> >> have sbi->s_es->s_flags & EXT2_FLAGS_HAS_IOPS as true right?
> >> So do we still need to keep both conditions here?
> >
> > Well, EXT2_FLAGS_HAS_IOPS determines whether this functionality is enabled
> > or not, while the GRP_TEST_IOPS check is for the individual group. So if
> > the feature is totally disabled (no EXT2_FLAGS_HAS_IOPS) then the per-group
> > bit should be ignored.
> >
>
> Ok, thanks for clarification.
>
> >>> With the change to allow regular data to be allocated in IOPS groups,
> >>> there might need to be an extra check added here to see what allocation
> >>> phase this is. Should we add *higher* CR_ phases above CR_ANY_FREE to
> >>> allow distinguishing between IOPS->regular fallback and regular->IOPS
> >>> fallback?
> >>>
> >>>
> >>> It seems like most of the complexity/issues here have crept in since the
> >>> addition of the fallback for regular data allocations in IOPS groups...
> >>> I'm not sure if we want to defer that functionality to a separate patch,
> >>> or if you have any suggestions on how to clarify this without adding a
> >>> lot of complexity?
> >>
> >> I agree that the separation is not clear. I think it would have been
> >> better if we would have split that functionality in 2 separate patches.
> >> The 1st patch just adds the functionality that you intended i.e.
> >>
> >> 1. metadata allocations should happen via IOPS group and only if there
> >> is no space left in IOPS group it will fallback to non-IOPS group.
> >> This 1st patch also have data allocations coming only from non-iops group.
> >>
> >> and the second patch can have details of...
> >> 2. adding a knob which can allow users to fallback data allocations to IOPS group too.
> >
> > Sure, I would be happy with that. My main goal is to reserve these flags
> > and get this feature working in a basic fashion, and then more elaborate
> > policy decisions can be added once there is a demand for it.
> >
>
> Sure make sense.
>
> >> If you think you would like to defer the second patch to later to avoid
> >> the complexity, I am fine with that too. The reason is we should still
> >> think upon what should be the fallback critera for that. Should we do it
> >> when we absolutely have no space in non-IOPS group (cr >= CR_ANY_FREE)
> >> or is it ok to fallback even earlier. I guess it will also depend upon
> >> the information of how many groups are IOPS v/s non-IOPS.
> >>
> >> I don't think we are keeping that information anywhere on disk right?
> >> (no. of IOPS v/s non-IOPS groups). That means we might have to do that
> >> at runtime. Once we have that information, the filesystem can better
> >> decide on when should the fallback happen.
> >
> > Mount already scans all of the groups at mount to set the IOPS flags in
> > the in-memory group_info, so the count of IOPS groups vs. non-IOPS could
> > be easily determined, if there is a use for this.
> >
>
> Yup.
>
>
> >> So I agree, we need to more discussion and think it through. I guess Ted
> >> was also suggesting the same on the call. Feel free to defer the
> >> fallback of data allocations to non-IOPS group for a later time (If
> >> we don't have a strong objection from others on this).
> >
> > Great, thanks for your review and feedback.
>
> Thanks for helping with the queries.
>
> -ritesh
Powered by blists - more mailing lists