linux-kernel - Re: [PATCH 0/5] *** Introduce new space allocation algorithm ***

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CANubcdVbimowVMdoH+Tzk6AZuU7miwf4PrvTv2Dh0R+eSuJ1CQ@mail.gmail.com>
Date: Mon, 4 Nov 2024 17:25:38 +0800
From: Stephen Zhang <starzhangzsd@...il.com>
To: Dave Chinner <david@...morbit.com>
Cc: djwong@...nel.org, dchinner@...hat.com, leo.lilong@...wei.com, 
	wozizhi@...wei.com, osandov@...com, xiang@...nel.org, 
	zhangjiachen.jaycee@...edance.com, linux-xfs@...r.kernel.org, 
	linux-kernel@...r.kernel.org, zhangshida@...inos.cn
Subject: Re: [PATCH 0/5] *** Introduce new space allocation algorithm ***

Dave Chinner <david@...morbit.com> 于2024年11月4日周一 11:32写道：
>
> On Mon, Nov 04, 2024 at 09:44:34AM +0800, zhangshida wrote:
> > From: Shida Zhang <zhangshida@...inos.cn>
> >
> > Hi all,
> >
> > Recently, we've been encounter xfs problems from our two
> > major users continuously.
> > They are all manifested as the same phonomenon: a xfs
> > filesystem can't touch new file when there are nearly
> > half of the available space even with sparse inode enabled.
>
> What application is causing this, and does using extent size hints
> make the problem go away?
>

Both are database-like applications, like mysql. Their source code
isn't available to us. And I doubt if they have the ability to modify the
database source code...

> Also, xfs_info and xfs_spaceman free space histograms would be
> useful information.
>

There are two such cases.
In one case:
$ xfs_info disk.img
meta-data=disk.img               isize=512    agcount=344, agsize=1638400 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=563085312, imaxpct=25
         =                       sunit=64     swidth=64 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

$ xfs_db -c freesp disk.img
   from      to extents  blocks    pct
      1       1 43375262 43375262  22.32
      2       3 64068226 150899026  77.66
      4       7       1       5   0.00
     32      63       3     133   0.00
    256     511       1     315   0.00
   1024    2047       1    1917   0.00
   8192   16383       2   20477   0.01


Another was mentioned already by one of my teammates. See:
https://lore.kernel.org/linux-xfs/173053338963.1934091.14116776076321174850.b4-ty@kernel.org/T/#t

[root@...alhost ~]# xfs_db -c freesp /dev/vdb
   from      to extents  blocks    pct
      1       1     215     215   0.01
      2       3  994476 1988952  99.99


> > It turns out that the filesystem is too fragmented to have
> > enough continuous free space to create a new file.
>
> > Life still has to goes on.
> > But from our users' perspective, worse than the situation
> > that xfs is hard to use is that xfs is non-able to use,
> > since even one single file can't be created now.
> >
> > So we try to introduce a new space allocation algorithm to
> > solve this.
> >
> > To achieve that, we try to propose a new concept:
> >    Allocation Fields, where its name is borrowed from the
> > mathmatical concepts(Groups,Rings,Fields), will be
>
> I have no idea what this means. We don't have rings or fields,
> and an allocation group is simply a linear address space range.
> Please explain this concept (pointers to definitions and algorithms
> appreciated!)
>
>
> > abbrivated as AF in the rest of the article.
> >
> > what is a AF?
> > An one-pic-to-say-it-all version of explaination:
> >
> > |<--------+ af 0 +-------->|<--+ af 1 +-->| af 2|
> > |------------------------------------------------+
> > | ag 0 | ag 1 | ag 2 | ag 3| ag 4 | ag 5 | ag 6 |
> > +------------------------------------------------+
> >
> > A text-based definition of AF:
> > 1.An AF is a incore-only concept comparing with the on-disk
> >   AG concept.
> > 2.An AF is consisted of a continuous series of AGs.
> > 3.Lower AFs will NEVER go to higher AFs for allocation if
> >   it can complete it in the current AF.
> >
> > Rule 3 can serve as a barrier between the AF to slow down
> > the over-speed extending of fragmented pieces.
>
> To a point, yes. But it's not really a reliable solution, because
> directories are rotored across all AGs. Hence if the workload is
> running across multiple AGs, then all of the AFs can be being
> fragmented at the same time.
>

You mean the inode of the directory is expected to be distributed
evenly over the entire system, and the file extent of that directory will be
distributed in the same way?

The ideal layout of af to be constructed is to limit the higher af
in the small part of the entire [0, agcount). Like:

|<-----+ af 0 +----->|<af 1>|
|----------------------------
| ag 0 | ag 1 | ag 2 | ag 3 |
+----------------------------

So for much of the ags(0, 1, 2) in af 0, that will not be a problem.
And for the ag in the small part, like ag 3.
if there is inode in ag3, and there comes the space allocation of the
inode, it will not find space in ag 3 first. It will still search from the
af0 to af1, whose logic is reflected in the patch:

[PATCH 4/5] xfs: add infrastructure to support AF allocation algorithm

it says:

+ /* if start_agno is not in current AF range, make it be. */
+ if ((start_agno < start_af) || (start_agno > end_af))
+       start_agno = start_af;

which means, the start_agno will not be used to comply with locality
principle.

In general, the evenly distributed layout is slightly broken, but only for
the last small AG, if you choose the AF layout properly.

> Given that I don't know how an application controls what AF it's
> files are located in, I can't really say much more than that.
>
> > With these patches applied, the code logic will be exactly
> > the same as the original code logic, unless you run with the
> > extra mount opiton. For example:
> >    mount -o af1=1 $dev $mnt
> >
> > That will change the default AF layout:
> >
> > |<--------+ af 0 +--------->|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> >
> > to :
> >
> > |<-----+ af 0 +----->|<af 1>|
> > |----------------------------
> > | ag 0 | ag 1 | ag 2 | ag 3 |
> > +----------------------------
> >
> > So the 'af1=1' here means the start agno is one ag away from
> > the m_sb.agcount.
>
> Yup, so kinda what we did back in 2006 in a proprietary SGI NAS
> product with "concat groups" to create aggregations of allocation
> groups that all sat on the same physical RAID5 luns in a linear
> concat volume. They were fixed size, because the (dozens of) luns
> were all the same size. This construct was heavily tailored to
> maximising the performance provided by the underlying storage
> hardware architecture, so wasn't really a general policy solution.
>
> To make it work, we also had to change how various other allocation
> distribution algorithms worked (e.g. directory rotoring) so that
> the load was distributed more evenly across the physical hardware
> backing the filesystem address space.
>
> I don't see anything like that in this patch set - there's no actual
> control mechanism to select what AF an inode lands in.  how does an
> applicaiton or user actually use this reliably to prevent all the
> AFs being fragmented by the workload that is running?
>

> > 3.Lower AFs will NEVER go to higher AFs for allocation if
> >   it can complete it in the current AF.

>From that rule, we can infer that,
     For any specific af, if len1 > len2, then,
     P(len1) <= P(len2)

where P(len) represents the probability of the success allocation for an
exact *len* length of extent.

To prove that, Imagine we have to allocate two extent at len 1 and 4 in af 0,
if we can allocate len 4 in af 0, then we can allocate len 1 in af 0.
but,
if we can allocate len 1 in af 1, we may not allocate len 4 in af 0.

So, P(len1) <= P(len2).

That means it will naturally form a layer of different len. like:

       +------------------------+
       |            8           |
af 2   |    1   8     8  1      |
       |       1   1            |
       +------------------------+
       |                        |
       |    4                   |
       |          4             |
af 1   |        4     1         |
       |    1       4           |
       |                  4     |
       +------------------------+
       |                        |
       |  1     1     1         |
       |                        |
       |           1            |
       |  1  1 4       1        |
af 0   |           1            |
       |      1                 |
       |                  1     |
       |          1             |
       |                        |
       +------------------------+

So there is no need so provide extra preference control info for
an allocation. It will naturally find where it should go.

So without the extra need of changing the application source code.




> > We did some tests verify it. You can verify it yourself
> > by running the following the command:
> >
> > 1. Create an 1g sized img file and formated it as xfs:
> >   dd if=/dev/zero of=test.img bs=1M count=1024
> >   mkfs.xfs -f test.img
> >   sync
> > 2. Make a mount directory:
> >   mkdir mnt
> > 3. Run the auto_frag.sh script, which will call another scripts
> >   frag.sh. These scripts will be attached in the mail.
> >   To enable the AF, run:
> >     ./auto_frag.sh 1
> >   To disable the AF, run:
> >     ./auto_frag.sh 0
> >
> > Please feel free to communicate with us if you have any thoughts
> > about these problems.
>
> We already have inode/metadata preferred allocation groups that
> are avoided for data allocation if at all possible. This is how we
> keep space free below 1TB for inodes when the inode32 allocator has
> been selected. See xfs_perag_prefers_metadata().
>
> Perhaps being able to control this preference from userspace (e.g.
> via xfs_spaceman commands through ioctls and/or sysfs knobs) would
> acheive the same results with a minimum of code and/or policy
> changes. i.e. if AG0 is preferred for metadata rather than data,
> we won't allocate data in it until all higher AGs are largely full.
>
> -Dave.
> --
> Dave Chinner
> david@...morbit.com