lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  PHC 
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 29 Aug 2022 11:04:34 +0200
From:   Jan Kara <>
To:     Ojaswin Mujoo <>
Cc:     Jan Kara <>, Stefan Wahren <>,
        Ted Tso <>,,
        Thorsten Leemhuis <>,
        Harshad Shirwadkar <>
Subject: Re: [PATCH 0/2] ext4: Fix performance regression with mballoc

On Sat 27-08-22 20:06:00, Ojaswin Mujoo wrote:
> On Fri, Aug 26, 2022 at 12:15:22PM +0200, Jan Kara wrote:
> > Hi Stefan,
> > 
> > On Thu 25-08-22 18:57:08, Stefan Wahren wrote:
> > > > Perhaps if you just download the archive manually, call sync(1), and measure
> > > > how long it takes to (untar the archive + sync) in mb_optimize_scan=0/1 we
> > > > can see whether plain untar is indeed making the difference or there's
> > > > something else influencing the result as well (I have checked and
> > > > rpi-update does a lot of other deleting & copying as the part of the
> > > > update)? Thanks.
> > > 
> > > mb_optimize_scan=0 -> almost 5 minutes
> > > 
> > > mb_optimize_scan=1 -> almost 18 minutes
> > > 
> > >
> > 
> > Thanks! So now the iostat data indeed looks substantially different.
> > 
> > 			nooptimize	optimize
> > Total written		183.6 MB	190.5 MB
> > Time (recorded)		283 s		1040 s
> > Avg write request size	79 KB		41 KB
> > 
> > So indeed with mb_optimize_scan=1 we do submit substantially smaller
> > requests on average. So far I'm not sure why that is. Since Ojaswin can
> > reproduce as well, let's see what he can see from block location info.
> > Thanks again for help with debugging this and enjoy your vacation!
> > 
> Hi Jan and Stefan,
> Apologies for the delay, I was on leave yesterday and couldn't find time to get to this.
> So I was able to collect the block numbers using the method you suggested. I converted the 
> blocks numbers to BG numbers and plotted that data to visualze the allocation spread. You can 
> find them here:
> mb-opt=0, patched kernel:
> mb-opt=1, patched kernel:
> mb-opt=1, unpatched kernel:
> Observations:
> * Before the patched mb_optimize_scan=1 allocations were way more spread out in
>   40 different BGs.
> * With the patch, we still allocate in 36 different BGs but majority happen in
>   just 1 or 2 BGs.
> * With mb_optimize_scan=0, we only allocate in just 7 unique BGs, which could
>   explain why this is faster.

Thanks for testing Ojaswin! Based on iostats from Stefan, I'm relatively
confident the spread between block groups is responsible for the
performance regression. Iostats show pretty clearly that the write
throughput is determined by the average write request size which is
directly related to the number of block groups we allocate from.

Your stats for patched kernel show that there are two block groups which
get big part of allocations (these are likely the target block groups) but
then remaining ~1/3 is spread a lot. I'm not yet sure why that is... I
guess I will fiddle some more with my test VM and try to reproduce these
allocation differences (on my test server the allocation pattern on patched
kernel is very similar with mb_optimize_scan=0/1).

> Also, one strange thing I'm seeing is that the perfs don't really show any
> particular function causing the regression, which is surprising considering
> mb_optimize_scan=1 almost takes 10 times more time.

Well, the time is not spent by CPU. We spend more time waiting for IO which
is not visible in perf profiles. You could plot something like offcpu flame
graphs, there the difference would be visible but I don't expect you'd see
anything more than just we spend more time in functions waiting for
writeback to complete.

> Lastly, FWIW I'm not able to replicate the regression when using loop devices
> and mb_optmize_scan=1 performs similar to mb-opmtimize_scan=0 (without patches
> as well). Not sure if this is related to the issue or just some side effect of
> using loop devices.

This is because eMMC devices seem to be very sensitive to IO pattern (and
write request size). For loop devices, we don't care about request size
much so that's why mb_optimize_scan makes no big difference. But can you
still see the difference in the allocation pattern with the loop device?

Jan Kara <>

Powered by blists - more mailing lists