linux-ext4 - Re: fragmentation optimization

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170925115715.2wen25de35iv5hse@rh_laptop>
Date:   Mon, 25 Sep 2017 13:57:15 +0200
From:   Lukas Czerner <lczerner@...hat.com>
To:     Jaco Kroon <jaco@....co.za>
Cc:     linux-ext4@...r.kernel.org, Theodore Ts'o <tytso@....edu>
Subject: Re: fragmentation optimization

On Sat, Sep 23, 2017 at 09:49:25AM +0200, Jaco Kroon wrote:
> Hi Ted, Everyone,
> 
> During our last discussions you mentioned the following (2017/08/16 5:06
> SAST/GMT+2):
> 
> "One other thought.  There is an ext4 block allocator optimization
> "feature" which is biting us here.  At the moment we have an
> optimization where if there is small "hole" in the logical block
> number space, we leave a "hole" in the physical blocks allocated to
> the file."
> 
> You proceeded to provide the example regarding writing of object files as
> per binutils (ld specifically).
> 
> As per the data I provided you previously rsync (with --sparse) is
> generating a lot of "holes" for us due to this.  As a result I end up with a
> rather insane amount of fragmentation:
> 
> Blocksize: 4096 bytes
> Total blocks: 13153337344
> Free blocks: 1272662587 (9.7%)
> 
> Min. free extent: 4 KB
> Max. free extent: 17304 KB
> Avg. free extent: 44 KB
> Num. free extent: 68868260
> 
> HISTOGRAM OF FREE EXTENT SIZES:
> Extent Size Range :  Free extents   Free Blocks  Percent
>     4K...    8K-  :      28472490      28472490    2.24%
>     8K...   16K-  :      27005860      55030426    4.32%
>    16K...   32K-  :       2595993      14333888    1.13%
>    32K...   64K-  :       2888720      32441623    2.55%
>    64K...  128K-  :       2745121      62071861    4.88%
>   128K...  256K-  :       2303439     103166554    8.11%
>   256K...  512K-  :       1518463     134776388   10.59%
>   512K... 1024K-  :        902691     163108612   12.82%
>     1M...    2M-  :        314858     105445496    8.29%
>     2M...    4M-  :         97174      64620009    5.08%
>     4M...    8M-  :         22501      28760501    2.26%
>     8M...   16M-  :           945       2069807    0.16%
>    16M...   32M-  :             5         21155    0.00%

Hi,

looking at the data like this is not really giving me much enlightment
on what's going on. You're only left with less than 10% of free space
and that alone might play some role in your fragmentation. Filefrag
might give us better picture.

Also, I do not see any mention of how this hurts you exactly ? There is
going to be some cost associated with processing bigger extent tree,
or reading fragmented file from disk. However, do you have any data
backing this up ?

One other thing you could try is to use --preallocate for rsync. This
should preallocate entire file size, before writing into it. It should
help with fragmentation. This also has a sideeffect of ext4 using another
optimization where instead of splitting the extent when leaving a hole in
the file it will write zeroes to fill the gap instead. The maximum size
of the hole we're going to zeroout can be configured by
/sys/fs/ext4/<device>/extent_max_zeroout_kb. By default this is 32kB.


-Lukas

> 
> Based on the behavior I notice by watching how rsync works[1] I greatly
> suspect that writes are sequential from start of file to end of file.
> Regarding the above "feature" you further proceeded to mention:
> 
> "However, it obviously doesn't do the right thing for rsync --sparse,
> and these days, thanks to delayed allocation, so long as binutils can
> finish writing the blocks within 30 seconds, it doesn't matter if GNU
> ld writes the blocks in a completely random order, since we will only
> attempt to do the writeback to the disk after all of the holes in the
> .o file have been filled in.  So perhaps we should turn off this ext4
> block allocator optimization if delayed allocation is enabled (which
> is the default these days)."
> 
> You mentioned a few pros and cons of this approach as well, and also
> mentioned that it won't help my existing filesystem, however, I suspect it
> might in combination with a e4defrag sweep (which if it takes a few weeks in
> the background that's fine by me).  Also, I suspect disabling this might
> help avoid future holes, and since persistence of files varies (from a week
> to a year) I suspect it may help to over time slowly improve performance.
> 
> I'm also relatively comfortable to make the 30s write limit even longer (as
> you pointed out the files causing the problems are typically 300GB+ even
> though on average my files are very small), permitting that I won't
> introduce additional file-system corruption risk.  Also keeping in mind that
> I run anything from 10 to 20 concurrent rsync instances at any point in
> time.
> 
> I would like to attempt such a patch, so if you (or someone else) could
> possibly point me in an appropriate direction of where to start work on this
> I would really appreciate the help.
> 
> Another approach for me may be to simply switch off --sparse since
> especially now I'm unsure of it's benefit.  I'm guessing that I could do a
> sweep of all inodes to determine how much space is really being saved by
> this.
> 
> Kind Regards,
> Jaco
> 
> [1] My observed behaviour when syncing a file (without --inplace which is in
> my opinion a bad idea in general unless you're severely space constrained,
> and then I honestly don't know how this situation would be affected) is that
> rsync will create a new file, and then the file size of this file will grow
> slowly (not, not disk usage, but size as reported by ls) until it reaches
> the file size of the new file, and at this point rsync will use rename(2) to
> replace the old file with the new one (which is the right approach).
> 
>