linux-kernel - Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEk8xcy9Zykrx3m3@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Wed, 11 Jun 2025 13:52:29 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Baokun Li <libaokun1@...wei.com>
Cc: linux-ext4@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
        jack@...e.cz, linux-kernel@...r.kernel.org, yi.zhang@...wei.com,
        yangerkun@...wei.com
Subject: Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

On Tue, Jun 10, 2025 at 09:48:30PM +0800, Baokun Li wrote:
> On 2025/6/10 20:06, Ojaswin Mujoo wrote:
> > On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
> > > On 2025/5/28 22:53, Ojaswin Mujoo wrote:
> > > > On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@...weicloud.com wrote:
> > > 
> > > ----------------------------------------------------------
> > >                       |       base      |      patched    |
> > > ---------------------|--------|--------|--------|--------|
> > > mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
> > > ---------------------|--------|--------|--------|--------|
> > > bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
> > > Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
> > > Avg. extents per file| 261879 | 262039 | 744    | 2084   |
> > > Avg. size per extent | 4      | 4      | 1408   | 503    |
> > > Fragmentation score  | 100    | 100    | 2      | 6      |
> > > ----------------------------------------------------------
> > Hi Baokun,
> > 
> > Thanks for the info and data and apologies for being late, I caught a
> > viral last week :/
> Hi Ojaswin,
> 
> Being sick really takes a toll.
> Please get some good rest and take care of yourself.

Thanks Baokun!

<snip>

> > 
> > > Another reason is that opt_scan tends to allocate from groups that have
> > > just received freed blocks, causing it to constantly "jump around"
> > > between certain groups.
> > > 
> > > This leads to intense contention between allocation and release, and even
> > > between release events. In contrast, linear traversal always moves forward
> > > without revisiting groups, resulting in less contention between allocation
> > > and release.
> > By just received free blocks, you mean the blocks got freed in the group
> > right?
> Yes.
> > I was under the impression that when we free blocks and the group's
> > largest order/ avg fragment changes, the group is added to the end of
> > the free fragment list/order list so it should be the last to be picked.
> > Is that not the case?
> Yes, we are indeed adding the group to the list tail. However, because
> we traverse all ordered lists from low to high, a group might end up
> "bouncing" repeatedly between order_0 and order_1 here.
> 
> For instance, if order_1 only contains group 1, linear traversal would
> rarely revisit it after the initial pass. However, after a non-linear
> allocation, this group is moved from the order_1 list to the order_0 list.
> When blocks are freed, it's moved back to the order_1 list, and then
> non-linear traversal prioritizes allocation in this same group again...

Right, I get what you mean now. Thanks.

> > 
> > > However, because linear involves more groups in allocation, journal
> > > becomes a bottleneck. If opt_scan first attempts to traverse block groups
> > > to the right from the target group in all lists, and then from index 0 to
> > > the left in all lists, competition between block groups would be
> > > significantly reduced.
> > > 
> > > To enable ordered traversal, we attempted to convert list_head to an
> > > ordered xarray. This ordering prevents "bouncing" during retries.
> > > Additionally, traversing all right-side groups before left-side groups
> > > significantly reduced contention. Performance improved from 10430 to 17730.
> > Do you maybe have some code you can share of this?
> Yes, V2 will include those changes.

Okay sure

<snip

> > > In a single container, create a file, then repeatedly append 8KB using
> > > fallocate until the file reaches 1MB. After that, truncate the file to
> > > 0 and continue appending 8KB with fallocate. The 64 containers will
> > > occupy a maximum of 64MB of disk space in total, so they won't fill the
> > > entire file system.
> > Also, as per your theory, if we do similar experiments in a fragmented
> > FS we should see opt_scan perform better right? I'd like to test this as
> > well. I'll try to play with the scripts you have shared.
> > 
> Yes, mb_optimize_scan performs well when free space fragmentation is
> severe. We have a 600TB disk array where the write bandwidth is
> approximately 5 GB/s when empty. When space utilization reaches 95%,
> linear traversal drops bandwidth to 1 GB/s, whereas enabling
> mb_optimize_scan restores it to 4.2 GB/s.

Got it, thanks for confirming. Seems like in mostly empty FS linear
traversal seems to do better. Makes me wonder if we should use some
heurestic to switch between linear and opt_scan based allocation, for
example opt_scan can be used if FS is 60% full or has a fragmentation 
score of X. (or something like that..)

But Im also curious about your improved optimize scan implementation
with ordered traversal, so lets see how that goes first.

> 
> 
> Cheers,
> Baokun
>