linux-kernel - Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aEgfwKvcJzt9gkGq@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>
Date: Tue, 10 Jun 2025 17:36:24 +0530
From: Ojaswin Mujoo <ojaswin@...ux.ibm.com>
To: Baokun Li <libaokun1@...wei.com>
Cc: linux-ext4@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca,
        jack@...e.cz, linux-kernel@...r.kernel.org, yi.zhang@...wei.com,
        yangerkun@...wei.com, libaokun@...weicloud.com
Subject: Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation

On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote:
> On 2025/5/28 22:53, Ojaswin Mujoo wrote:
> > On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@...weicloud.com wrote:
> > > From: Baokun Li <libaokun1@...wei.com>

<...>

> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > |    -   |    1   |    2   |    4   |    8   |   16   |   32   |   64   |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > |  base  | 295287 | 70665  | 33865  | 19387  | 10104  |  5588  |  3588  |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > | linear | 286328 | 123102 | 119542 | 90653  | 60344  | 35302  | 23280  |
> > > |        | -3.0%  | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > > |mb_optim| 292498 | 133305 | 103069 | 61727  | 29702  | 16845  | 10430  |
> > > |ize_scan| -0.9%  | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% |
> > > |--------|--------|--------|--------|--------|--------|--------|--------|
> > Hey Baokun, nice improvements! The proposed changes make sense to me,
> > however I suspect the performance improvements may come at a cost of
> > slight increase in fragmentation, which might affect rotational disks
> > especially. Maybe comparing e2freefrag numbers with and without the
> > patches might give a better insight into this.
> While this approach might slightly increase free space fragmentation on
> the disk, it significantly reduces file fragmentation, leading to faster
> read speeds on rotational disks.
> 
> When multiple processes contend for free blocks within the same block
> group, the probability of blocks allocated by the same process being
> merged on consecutive allocations is low. This is because other processes
> may preempt the free blocks immediately following the current process's
> last allocated region.
> 
> Normally, we rely on preallocation to avoid files becoming overly
> fragmented (even though preallocation itself can cause fragmentation in
> free disk space). But since fallocate doesn't support preallocation, the
> fragmentation issue is more pronounced. Counterintuitively, skipping busy
> groups actually boosts opportunities for file extent merging, which in turn
> reduces overall file fragmentation.
> 
> Referencing will-it-scale/fallocate2, I tested 64 processes each appending
> 4KB via fallocate to 64 separate files until they reached 1GB. This test
> specifically examines contention in block allocation, unlike fallocate2,
> it omits the contention between fallocate and truncate. Preliminary results
> are provided below; detailed scripts and full test outcomes are attached in
> the email footer.
> 
> ----------------------------------------------------------
>                      |       base      |      patched    |
> ---------------------|--------|--------|--------|--------|
> mb_optimize_scan     | linear |opt_scan| linear |opt_scan|
> ---------------------|--------|--------|--------|--------|
> bw(MiB/s)            | 217    | 219    | 5685   | 5670   |
> Avg. free extent size| 1943732| 1943728| 1439608| 1368328|
> Avg. extents per file| 261879 | 262039 | 744    | 2084   |
> Avg. size per extent | 4      | 4      | 1408   | 503    |
> Fragmentation score  | 100    | 100    | 2      | 6      |
> ----------------------------------------------------------

Hi Baokun,

Thanks for the info and data and apologies for being late, I caught a
viral last week :/ 

These numbers look pretty interesting and your explanation of why the
fragmentation is better makes sense. It is definitely a win-win then for 
performance and fragmentation!

> > Regardless the performance benefits are significant and I feel it is
> > good to have these patches.
> > 
> > I'll give my reviews individually as I'm still going through patch 2
> > However, I wanted to check on a couple things:
> Okay, thank you for your feedback.
> > 
> > 1. I believe you ran these in docker. Would you have any script etc open
> >     sourced that I can use to run some benchmarks on my end (and also
> > 	 understand your test setup).
> Yes, these two patches primarily mitigate contention between block
> allocations and between block allocation and release. The testing script
> can be referenced from the fio script mentioned earlier in the email
> footer. You can also add more truncate calls based on it.

Thanks for the scripts.

> > 2. I notice we are getting way less throughput in mb_optimize_scan? I
> >     wonder why that is the case. Do you have some data on that? Are your
> >     tests starting on an empty FS, maybe in that case linear scan works a
> >     bit better since almost all groups are empty. If so, what are the
> >     numbers like when we start with a fragmented FS?
> The throughput of mb_optimize_scan is indeed much lower, and we continue
> to optimize it, as mb_optimize_scan is the default mount option and
> performs better in scenarios with large volume disks and high space usage.
> 
> Disk space used is about 7%; mb_optimize_scan should perform better with
> less free space. However, this isn't the critical factor. The poor
> throughput here is due to the following reasons。
> 
> One reason is that mb_optimize_scan's list traversal is unordered and
> always selects the first group.
> 
> While traversing the list, holding a spin_lock prevents load_buddy, making
> direct use of ext4_lock_group impossible. This can lead to a "bouncing"
> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group()
> fails, forcing the list traversal to repeatedly restart from grp_A.
> 
> In contrast, linear traversal directly uses ext4_try_lock_group(),
> avoiding this bouncing. Therefore, we need a lockless, ordered traversal
> to achieve linear-like efficiency.

Hmm, right the non ordered traversal has led to other issues as well in
the past.

> 
> Another reason is that opt_scan tends to allocate from groups that have
> just received freed blocks, causing it to constantly "jump around"
> between certain groups.
> 
> This leads to intense contention between allocation and release, and even
> between release events. In contrast, linear traversal always moves forward
> without revisiting groups, resulting in less contention between allocation
> and release.

By just received free blocks, you mean the blocks got freed in the group
right? I was under the impression that when we free blocks and the group's
largest order/ avg fragment changes, the group is added to the end of
the free fragment list/order list so it should be the last to be picked.
Is that not the case?

> 
> However, because linear involves more groups in allocation, journal
> becomes a bottleneck. If opt_scan first attempts to traverse block groups
> to the right from the target group in all lists, and then from index 0 to
> the left in all lists, competition between block groups would be
> significantly reduced.
> 
> To enable ordered traversal, we attempted to convert list_head to an
> ordered xarray. This ordering prevents "bouncing" during retries.
> Additionally, traversing all right-side groups before left-side groups
> significantly reduced contention. Performance improved from 10430 to 17730.

Do you maybe have some code you can share of this? 

> 
> However, xarray traversal introduces overhead; list_head group selection
> was O(1), while xarray becomes O(n log n). This results in a ~10%
> performance drop in single-process scenarios, and I'm not entirely sure if
> this trade-off is worthwhile. 🤔
> 
> Additionally, by attempting to merge before inserting in
> ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock
> during merges, resulting in roughly a 5% performance gain.
> > 
> >     - Or maybe it is that the lazyinit thread has not yet initialized all
> >     the buddies yet which means we have lesser BGs in the freefrag list
> >     or the order list used by faster CRs. Hence, if they are locked we
> >     are falling more to CR_GOAL_LEN_SLOW. To check if this is the case,
> >     one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along
> >     the lines) before the benchmark, which forces init of all the group
> >     buddies thus populating all the lists used by mb_opt_scan. Maybe we
> >     can check if this gives better results.
> All groups are already initialized at the time of testing, and that's not
> where the problem lies.
> > 
> > 3. Also, how much IO are we doing here, are we filling the whole FS?
> > 
> In a single container, create a file, then repeatedly append 8KB using
> fallocate until the file reaches 1MB. After that, truncate the file to
> 0 and continue appending 8KB with fallocate. The 64 containers will
> occupy a maximum of 64MB of disk space in total, so they won't fill the
> entire file system.

Also, as per your theory, if we do similar experiments in a fragmented
FS we should see opt_scan perform better right? I'd like to test this as
well. I'll try to play with the scripts you have shared.

Thanks again for the detailed response and scripts!

Regards,
ojaswin

> 
> Cheers,
> Baokun
> 
> 
> ======================== test script ========================
> 
> #!/bin/bash
> 
> dir="/tmp/test"
> disk="/dev/sda"
> numjobs=64
> iodepth=128
> 
> mkdir -p $dir
> 
> for scan in 0 1 ; do
>     mkfs.ext4 -F -E lazy_itable_init=0,lazy_journal_init=0 -O orphan_file
> $disk
>     mount -o mb_optimize_scan=$scan $disk $dir
> 
>     fio -directory=$dir -direct=1 -iodepth ${iodepth} -thread -rw=write
> -ioengine=falloc -bs=4k -fallocate=none \
>         -size=1G -numjobs=${numjobs} -group_reporting -name=job1
> -cpus_allowed_policy=split -file_append=1
> 
>     e2freefrag $disk
>     e4defrag -c $dir # ** NOTE ** Without the patch, this could take 5-6
> hours.
>     filefrag ${dir}/job* | awk '{print $2}' | awk '{sum+=$1} END {print
> sum/NR}'
>     umount $dir
> done
> 

<snip>