linux-ext4 - Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YxmlNPMu58l9LTXU@li-bb2b2a4c-3307-11b2-a85c-8fa5c3a69313.ibm.com>
Date:   Thu, 8 Sep 2022 13:47:56 +0530
From:   Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To:     Jan Kara <jack@...e.cz>
Cc:     Ted Tso <tytso@....edu>, linux-ext4@...r.kernel.org,
        Thorsten Leemhuis <regressions@...mhuis.info>,
        Stefan Wahren <stefan.wahren@...e.com>,
        Andreas Dilger <adilger.kernel@...ger.ca>
Subject: Re: [PATCH 0/5 v2] ext4: Fix performance regression with mballoc

On Tue, Sep 06, 2022 at 05:29:06PM +0200, Jan Kara wrote:
> Hello,
> 
> Here is a second version of my mballoc improvements to avoid spreading
> allocations with mb_optimize_scan=1. The patches fix the performance
> regression I was able to reproduce with reaim on my test machine:
> 
>                      mb_optimize_scan=0     mb_optimize_scan=1     patched
> Hmean     disk-1       2076.12 (   0.00%)     2099.37 (   1.12%)     2032.52 (  -2.10%)
> Hmean     disk-41     92481.20 (   0.00%)    83787.47 *  -9.40%*    90308.37 (  -2.35%)
> Hmean     disk-81    155073.39 (   0.00%)   135527.05 * -12.60%*   154285.71 (  -0.51%)
> Hmean     disk-121   185109.64 (   0.00%)   166284.93 * -10.17%*   185298.62 (   0.10%)
> Hmean     disk-161   229890.53 (   0.00%)   207563.39 *  -9.71%*   232883.32 *   1.30%*
> Hmean     disk-201   223333.33 (   0.00%)   203235.59 *  -9.00%*   221446.93 (  -0.84%)
> Hmean     disk-241   235735.25 (   0.00%)   217705.51 *  -7.65%*   239483.27 *   1.59%*
> Hmean     disk-281   266772.15 (   0.00%)   241132.72 *  -9.61%*   263108.62 (  -1.37%)
> Hmean     disk-321   265435.50 (   0.00%)   245412.84 *  -7.54%*   267277.27 (   0.69%)
> 
> The changes also significanly reduce spreading of allocations for small /
> moderately sized files. I'm not able to measure a performance difference
> resulting from this but on eMMC storage this seems to be the main culprit
> of reduced performance. Untarring of raspberry-pi archive touches following
> numbers of groups:
> 
> 	mb_optimize_scan=0	mb_optimize_scan=1	patched
> groups	4			22			7
> 
> To achieve this I have added two more changes on top of v1 - patches 4 and 5.
> Patch 4 makes sure we use locality group preallocation even for files that are
> not likely to grow anymore (previously we have disabled all preallocations for
> such files, however locality group preallocation still makes a lot of sense for
> such files). This patch reduced spread of a small file allocations but larger
> file allocations were still spread significantly because they avoid locality
> group preallocation and as they are not power-of-two in size, they also
> immediately start with cr=1 scan. To address that I've changed the data
> structure for looking up the best block group to allocate from (see patch 5
> for details).
> 
> Stefan, can you please test whether these patches fix the problem for you as
> well? Comments & review welcome.
> 
> 								Honza
> Previous versions:
> Link: http://lore.kernel.org/r/20220823134508.27854-1-jack@suse.cz # v1

Hi Jan,

Thanks for the patch. I tested this series on my raspberry pi and I can
confirm that the regression is no longer present with both
mb_optimize_scan=0 and =1 taking similar amount of time to untar. The
allocation spread I'm seeing is as follows:
mb_optimize_scan=0: 10
mb_optimize_scan=1: 17 (Check graphs for more details)

For mb_optimize_scan=1, I also compared the spread of locality group PA
eligible files (<64KB) and inode pa files. The results can be found
here:

mb_optimize_scan=0:
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-mbopt0.png
mb_optimize_scan=1:
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2.png
mb_optimize_scan=1 (lg pa only):
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-lgs.png
mb_optimize_scan=1 (inode pa only):
https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/patchv2-i.png

The smaller files are now closer together due to the changes to
locality group pa logic. Most of the spread is now coming from mid to
large files.

To test this further, I created a tar of 2000 100KB files to see if
there is any performance drop due to higher spread of these files and
notcied that although the spread is slightly higher(5BGs vs 9), we don't
see a performance drop while untarring with mb_optimize_scan=1.

Although we still have some spread, I think we have brought it down to a
much more manageable level and that combined with improvements to CR1
allocation have given us a good performance improvement.

Feel free to add:
Tested-by: Ojaswin Mujoo <ojaswin@...ux.ibm.com>

Regards,
Ojaswin