linux-kernel - Re: [PATCH 0/5] *** Introduce new space allocation algorithm ***

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANubcdXv8rmRGERFDQUELes3W2s_LdvfCSrOuWK8ge=cdEhFYA@mail.gmail.com>
Date: Sun, 17 Nov 2024 09:34:53 +0800
From: Stephen Zhang <starzhangzsd@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: djwong@...nel.org, dchinner@...hat.com, leo.lilong@...wei.com, 
	wozizhi@...wei.com, osandov@...com, xiang@...nel.org, 
	zhangjiachen.jaycee@...edance.com, linux-xfs@...r.kernel.org, 
	linux-kernel@...r.kernel.org, zhangshida@...inos.cn
Subject: Re: [PATCH 0/5] *** Introduce new space allocation algorithm ***

Dave Chinner <david@...morbit.com> 于2024年11月11日周一 10:04写道：
>
> On Fri, Nov 08, 2024 at 09:34:17AM +0800, Stephen Zhang wrote:
> > Dave Chinner <david@...morbit.com> 于2024年11月4日周一 20:15写道：
> > > On Mon, Nov 04, 2024 at 05:25:38PM +0800, Stephen Zhang wrote:
> > > > Dave Chinner <david@...morbit.com> 于2024年11月4日周一 11:32写道：
> > > > > On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
>
> [snip unnecessary stereotyping, accusations and repeated information]
>
> > > AFAICT, this "reserve AG space for inodes" behaviour that you are
> > > trying to acheive is effectively what the inode32 allocator already
> > > implements. By forcing inode allocation into the AGs below 1TB and
> > > preventing data from being allocated in those AGs until allocation
> > > in all the AGs above start failing, it effectively provides the same
> > > functionality but without the constraints of a global first fit
> > > allocation policy.
> > >
> > > We can do this with any AG by setting it up to prefer metadata,
> > > but given we already have the inode32 allocator we can run some
> > > tests to see if setting the metadata-preferred flag makes the
> > > existing allocation policies do what is needed.
> > >
> > > That is, mkfs a new 2TB filesystem with the same 344AG geometry as
> > > above, mount it with -o inode32 and run the workload that fragments
> > > all the free space. What we should see is that AGs in the upper TB
> > > of the filesystem should fill almost to full before any significant
> > > amount of allocation occurs in the AGs in the first TB of space.
>
> Have you performed this experiment yet?
>
> I did not ask it idly, and I certainly did not ask it with the intent
> that we might implement inode32 with AFs. It is fundamentally
> impossible to implement inode32 with the proposed AF feature.
>
> The inode32 policy -requires- top down data fill so that AG 0 is the
> *last to fill* with user data. The AF first-fit proposal guarantees
> bottom up fill where AG 0 is the *first to fill* with user data.
>
> For example:
>
> > So for the inode32 logarithm:
> > 1. I need to specify a preferred ag, like ag 0:
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> > 2. Someday space will be used up to 100%, Then we have to growfs to ag 7:
> > +------+------+------+------+------+------+------+------+
> > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > +------+------+------+------+------+------+------+------+
> > 3. specify another ag for inodes again.
> > 4. repeat 1-3.
>
> Lets's assume that AGs are 512GB each and so AGs 0 and 1 fill the
> entire lower 1TB of the filesystem. Hence if we get to all AGs full
> the entire inode32 inode allocation space is full.
>
> Even if we grow the filesystem at this point, we still *cannot*
> allocate more inodes in the inode32 space. That space (AGs 0-1) is
> full even after the growfs.  Hence we will still give ENOSPC, and
> that is -correct behaviour- because the inode32 policy requires this
> behaviour.
>
> IOWs, growfs and changing the AF bounds cannot fix ENOSPC on inode32
> when the inode space is exhausted. Only physically moving data out
> of the lower AGs can fix that problem...
>
> > for the AF logarithm:
> >     mount -o af1=1 $dev $mnt
> > and we are done.
> > |<-----+ af 0 +----->|<af 1>|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> > because the af is a relative number to ag_count, so when growfs, it will
> > become:
> > |<-----+ af 0 +--------------------------------->|<af 1>|
> > +------+------+------+------+------+------+------+------+
> > | full | full | full | full | ag 4 | ag 5 | ag 6 | ag 7 |
> > +------+------+------+------+------+------+------+------+
> > So just set it once, and run forever.
>
> That is actually the general solution to the original problem being
> reported. I realised this about half way through reading your
> original proposal. This is why I pointed out inode32 and the
> preferred metadata mechanism in the AG allocator policies.
>
> That is, a general solution should only require the highest AG
> to be marked as metadata preferred. Then -all- data allocation will
> then skip over the highest AG until there is no space left in any of
> the lower AGs. This behaviour will be enforced by the existing AG
> iteration allocation algorithms without any change being needed.
>
> Then when we grow the fs, we set the new highest AG to be metadata
> preferred, and that space will now be reserved for inodes until all
> other space is consumed.
>
> Do you now understand why I asked you to test whether the inode32
> mount option kept the data out of the lower AGs until the higher AGs
> were completely filled? It's because I wanted confirmation that the
> metadata preferred flag would do what we need to implement a
> general solution for the problematic workload.
>

Hi, I have tested the inode32 mount option. To my suprise, the inode32
or the metadata preferred structure (will be referred to as inode32 for the
rest reply) doesn't implement the desired behavior as the AF rule[1] does:
        Lower AFs/AGs will do anything they can for allocation before going
to HIGHER/RESERVED AFs/AGS. [1]

While the inode32 does:
        Lower AFs/AGs won't do anything they can for allocation before going
to HIGHER/RESERVED AFs/AGS.

To illustrate that, imagine that now AG 2 is badly fragmented and AG 0/1 are
reserved for inode32:
       +------------------------+
       |                        |
ag 0   |                        |
       +------------------------+
       |                        |
ag 1   |                        |
       +------------------------+
       |  1     1     1         |
       |  1        1  1         |
       |  1  1 4       1        |
ag 2   |           1     1      |
       |      1     1           |
       |    4  8          1     |
       +------------------------+
We want a allocate a space len of 4, but ag 2 is so fragmented that there
is no such continuous space that we have only two choices:
1. Break down the 4 into many small pieces for a success allocation in AG 2.
2. Go to the reserved AG 0/1 for the allocation.

But unlike the AF, the inode32 will choose option 2...

To understand the reason for that, we must understand the general allocation
phases:
1. Best length fit. Find the best length for the current allocation in
two loops.
    1.1. First loop with *_TRY_LOCK flags.
    1.2. Second loop without *_TRY_LOCK flags.
2. Low space algorithm. Break the allocation into small pieces and fit them into
   the free space one by one.

So for the AF, it will do anything it can before going to higher AFs. *anything*
means the allocation must completely go through the whole 1.1->1.2->2 phase and
then go to the next AF.
But for the inode32, it will only go through 1.1-> and then go to the
reserved AG.

Take a look at the core code snippet for inode32 in xfs_alloc_fix_freelist():


        /*
         * If this is a metadata preferred pag and we are user data then try
         * somewhere else if we are not being asked to try harder at this
         * point
         */
        if (xfs_perag_prefers_metadata(pag) &&
            (args->datatype & XFS_ALLOC_USERDATA) &&
            (alloc_flags & XFS_ALLOC_FLAG_TRYLOCK)) {
                ASSERT(!(alloc_flags & XFS_ALLOC_FLAG_FREEING));
                goto out_agbp_relse;
        }

That's exactly how the inode32 sees if it should go to the RESERVED AG for
allocation or not.

The inode32 will see if the current alloction is in *_TRY_LOCK mode
or not, if it isn't, then it can go to the RESERVED AG for allocation.
But at this moment, the allocation in unreserved ags only have gone
through 1.1->...

And seen from the code analysis for metadata preference algorithm, using the
preference info to comply with the rule[1](indicate the unreserved AGs already
having gone through 1.1->1.2->2) will greatly increase the
system complexity compared with the AF algorithm, or basically impossible...

To sum it up:
1. The inode32/metadata-preference doesn't comply with the rule[1]. So it has
   no ability to solve the reported problem.
2. Since the inode32 is kinda conficted with AF, maybe the AF should be disabled
   when inode32 gets enabled...


> Remeber: free space fragmentation can happen for many reasons - this
> mysql thing is just the latest one discovered.  The best solution is
> having general mechanisms in the filesystem that automatically
> mitigate the effects of free space fragmentation on inode
> allocation. The worst solution is requiring users to tweak knobs...
>
> -Dave.
> --
> Dave Chinner
> david@...morbit.com