linux-ext4 - Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20220905101504.vnx377x7eao42izi@quack3>
Date:   Mon, 5 Sep 2022 12:15:04 +0200
From:   Jan Kara <jack@...e.cz>
To:     Andreas Dilger <adilger@...ger.ca>
Cc:     Stefan Wahren <stefan.wahren@...e.com>,
        Ojaswin Mujoo <ojaswin@...ux.ibm.com>, Jan Kara <jack@...e.cz>,
        Ted Tso <tytso@....edu>, linux-ext4@...r.kernel.org,
        Thorsten Leemhuis <regressions@...mhuis.info>,
        Harshad Shirwadkar <harshadshirwadkar@...il.com>
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Sun 04-09-22 12:32:59, Andreas Dilger wrote:
> On Sep 4, 2022, at 00:01, Stefan Wahren <stefan.wahren@...e.com> wrote:
> > 
> >> Am 27.08.22 um 16:36 schrieb Ojaswin Mujoo:
> >>> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> >>> Hi Stefan,
> >>> 
> >>> On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> >>>>> Perhaps if you just download the archive manually, call sync(1), and measure
> >>>>> how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> >>>>> can see whether plain untar is indeed making the difference or there's
> >>>>> something else influencing the result as well (I have checked and
> >>>>> rpi-update does a lot of other deleting & copying as the part of the
> >>>>> update)? Thanks.
> >>>> mb_optimize_scan=0 -> almost 5 minutes
> >>>> 
> >>>> mb_optimize_scan=1 -> almost 18 minutes
> >>>> 
> >>>> https://github.com/lategoodbye/mb_optimize_scan_regress/commit/3f3fe8f87881687bb654051942923a6b78f16dec
> >>> Thanks! So now the iostat data indeed looks substantially different.
> >>> 
> >>>            nooptimize    optimize
> >>> Total written        183.6 MB    190.5 MB
> >>> Time (recorded)        283 s        1040 s
> >>> Avg write request size    79 KB        41 KB
> >>> 
> >>> So indeed with mb_optimize_scan=1 we do submit substantially smaller
> >>> requests on average. So far I'm not sure why that is. Since Ojaswin can
> >>> reproduce as well, let's see what he can see from block location info.
> >>> Thanks again for help with debugging this and enjoy your vacation!
> >>> 
> >> Hi Jan and Stefan,
> >> 
> >> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> >> 
> >> So I was able to collect the block numbers using the method you suggested. I converted the
> >> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can
> >> find them here:
> >> 
> >> mb-opt=0, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-0-patched.png
> >> mb-opt=1, patched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-patched.png
> >> mb-opt=1, unpatched kernel: https://github.com/OjaswinM/mbopt-bug/blob/master/grpahs/mbopt-1-unpatched.png
> >> 
> >> Observations:
> >> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
> >>   40 different BGs.
> >> * With the patch, we still allocate in 36 different BGs but majority happen in
> >>   just 1 or 2 BGs.
> >> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
> >>   explain why this is faster.
> > 
> > thanks this is very helpful for me to understand. So it seems to me that with disabled mb_optimized_scan we have a more sequential write behavior and with enabled mb_optimized_scan a more random write behavior.
> > 
> > From my understanding writing small blocks at random addresses of the sd card flash causes a lot of overhead, because the sd card controller need to erase huge blocks (up to 1 MB) before it's able to program the flash pages. This would explain why this series doesn't fix the performance issue, the total amount of BGs is still much higher.
> > 
> > Is this new block allocation pattern a side effect of the optimization or desired?
> 
> The goal of the mb_optimized_scan is to avoid a large amount of linear
> scanning for very large filesystems when there are many block groups that
> are full or fragmented. 
> 
> It seems for empty filesystems the new list management is not very ideal.

The filesystems here are actually about half full and not too fragmented.

> It probably makes sense to have a hybrid, with some small amount of
> linear scanning (eg. a meta block group worth), and then use the new list
> to find a new group if that doesn't find any group with free space. 

There is a heuristic to scan a few block groups linearly before using the
data structures to decide about the next block group in current mballoc
code but it gets used only for rotational devices. I don't know about some
easy way how to detect other types of storage like eMMC cards that also
benefit from better data allocation locality.

I have come up with two more patches on top of my current attempt which
improve allocation locality and at least for the untar case causing issues
on eMMC they do get close to the mb_optimize_scan=0 locality. I want to
check whether the higher locality does not hurt performance for highly
parallel workloads though. Then I'll post them for review and discussion.

								Honza

-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR