lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <bff134da-e056-4eaf-b5ac-cace99208e40@huawei.com> Date: Tue, 10 Jun 2025 21:48:30 +0800 From: Baokun Li <libaokun1@...wei.com> To: Ojaswin Mujoo <ojaswin@...ux.ibm.com> CC: <linux-ext4@...r.kernel.org>, <tytso@....edu>, <adilger.kernel@...ger.ca>, <jack@...e.cz>, <linux-kernel@...r.kernel.org>, <yi.zhang@...wei.com>, <yangerkun@...wei.com>, Baokun Li <libaokun1@...wei.com> Subject: Re: [PATCH 0/4] ext4: better scalability for ext4 block allocation On 2025/6/10 20:06, Ojaswin Mujoo wrote: > On Thu, May 29, 2025 at 08:24:14PM +0800, Baokun Li wrote: >> On 2025/5/28 22:53, Ojaswin Mujoo wrote: >>> On Fri, May 23, 2025 at 04:58:17PM +0800, libaokun@...weicloud.com wrote: >>>> From: Baokun Li <libaokun1@...wei.com> > <...> > >>>> |--------|--------|--------|--------|--------|--------|--------|--------| >>>> | - | 1 | 2 | 4 | 8 | 16 | 32 | 64 | >>>> |--------|--------|--------|--------|--------|--------|--------|--------| >>>> | base | 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 | >>>> |--------|--------|--------|--------|--------|--------|--------|--------| >>>> | linear | 286328 | 123102 | 119542 | 90653 | 60344 | 35302 | 23280 | >>>> | | -3.0% | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% | >>>> |--------|--------|--------|--------|--------|--------|--------|--------| >>>> |mb_optim| 292498 | 133305 | 103069 | 61727 | 29702 | 16845 | 10430 | >>>> |ize_scan| -0.9% | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% | >>>> |--------|--------|--------|--------|--------|--------|--------|--------| >>> Hey Baokun, nice improvements! The proposed changes make sense to me, >>> however I suspect the performance improvements may come at a cost of >>> slight increase in fragmentation, which might affect rotational disks >>> especially. Maybe comparing e2freefrag numbers with and without the >>> patches might give a better insight into this. >> While this approach might slightly increase free space fragmentation on >> the disk, it significantly reduces file fragmentation, leading to faster >> read speeds on rotational disks. >> >> When multiple processes contend for free blocks within the same block >> group, the probability of blocks allocated by the same process being >> merged on consecutive allocations is low. This is because other processes >> may preempt the free blocks immediately following the current process's >> last allocated region. >> >> Normally, we rely on preallocation to avoid files becoming overly >> fragmented (even though preallocation itself can cause fragmentation in >> free disk space). But since fallocate doesn't support preallocation, the >> fragmentation issue is more pronounced. Counterintuitively, skipping busy >> groups actually boosts opportunities for file extent merging, which in turn >> reduces overall file fragmentation. >> >> Referencing will-it-scale/fallocate2, I tested 64 processes each appending >> 4KB via fallocate to 64 separate files until they reached 1GB. This test >> specifically examines contention in block allocation, unlike fallocate2, >> it omits the contention between fallocate and truncate. Preliminary results >> are provided below; detailed scripts and full test outcomes are attached in >> the email footer. >> >> ---------------------------------------------------------- >> | base | patched | >> ---------------------|--------|--------|--------|--------| >> mb_optimize_scan | linear |opt_scan| linear |opt_scan| >> ---------------------|--------|--------|--------|--------| >> bw(MiB/s) | 217 | 219 | 5685 | 5670 | >> Avg. free extent size| 1943732| 1943728| 1439608| 1368328| >> Avg. extents per file| 261879 | 262039 | 744 | 2084 | >> Avg. size per extent | 4 | 4 | 1408 | 503 | >> Fragmentation score | 100 | 100 | 2 | 6 | >> ---------------------------------------------------------- > Hi Baokun, > > Thanks for the info and data and apologies for being late, I caught a > viral last week :/ Hi Ojaswin, Being sick really takes a toll. Please get some good rest and take care of yourself. > > These numbers look pretty interesting and your explanation of why the > fragmentation is better makes sense. It is definitely a win-win then for > performance and fragmentation! > >>> Regardless the performance benefits are significant and I feel it is >>> good to have these patches. >>> >>> I'll give my reviews individually as I'm still going through patch 2 >>> However, I wanted to check on a couple things: >> Okay, thank you for your feedback. >>> 1. I believe you ran these in docker. Would you have any script etc open >>> sourced that I can use to run some benchmarks on my end (and also >>> understand your test setup). >> Yes, these two patches primarily mitigate contention between block >> allocations and between block allocation and release. The testing script >> can be referenced from the fio script mentioned earlier in the email >> footer. You can also add more truncate calls based on it. > Thanks for the scripts. > >>> 2. I notice we are getting way less throughput in mb_optimize_scan? I >>> wonder why that is the case. Do you have some data on that? Are your >>> tests starting on an empty FS, maybe in that case linear scan works a >>> bit better since almost all groups are empty. If so, what are the >>> numbers like when we start with a fragmented FS? >> The throughput of mb_optimize_scan is indeed much lower, and we continue >> to optimize it, as mb_optimize_scan is the default mount option and >> performs better in scenarios with large volume disks and high space usage. >> >> Disk space used is about 7%; mb_optimize_scan should perform better with >> less free space. However, this isn't the critical factor. The poor >> throughput here is due to the following reasons。 >> >> One reason is that mb_optimize_scan's list traversal is unordered and >> always selects the first group. >> >> While traversing the list, holding a spin_lock prevents load_buddy, making >> direct use of ext4_lock_group impossible. This can lead to a "bouncing" >> scenario where spin_is_locked(grp_A) succeeds, but ext4_try_lock_group() >> fails, forcing the list traversal to repeatedly restart from grp_A. >> >> In contrast, linear traversal directly uses ext4_try_lock_group(), >> avoiding this bouncing. Therefore, we need a lockless, ordered traversal >> to achieve linear-like efficiency. > Hmm, right the non ordered traversal has led to other issues as well in > the past. > >> Another reason is that opt_scan tends to allocate from groups that have >> just received freed blocks, causing it to constantly "jump around" >> between certain groups. >> >> This leads to intense contention between allocation and release, and even >> between release events. In contrast, linear traversal always moves forward >> without revisiting groups, resulting in less contention between allocation >> and release. > By just received free blocks, you mean the blocks got freed in the group > right? Yes. > I was under the impression that when we free blocks and the group's > largest order/ avg fragment changes, the group is added to the end of > the free fragment list/order list so it should be the last to be picked. > Is that not the case? Yes, we are indeed adding the group to the list tail. However, because we traverse all ordered lists from low to high, a group might end up "bouncing" repeatedly between order_0 and order_1 here. For instance, if order_1 only contains group 1, linear traversal would rarely revisit it after the initial pass. However, after a non-linear allocation, this group is moved from the order_1 list to the order_0 list. When blocks are freed, it's moved back to the order_1 list, and then non-linear traversal prioritizes allocation in this same group again... > >> However, because linear involves more groups in allocation, journal >> becomes a bottleneck. If opt_scan first attempts to traverse block groups >> to the right from the target group in all lists, and then from index 0 to >> the left in all lists, competition between block groups would be >> significantly reduced. >> >> To enable ordered traversal, we attempted to convert list_head to an >> ordered xarray. This ordering prevents "bouncing" during retries. >> Additionally, traversing all right-side groups before left-side groups >> significantly reduced contention. Performance improved from 10430 to 17730. > Do you maybe have some code you can share of this? Yes, V2 will include those changes. > >> However, xarray traversal introduces overhead; list_head group selection >> was O(1), while xarray becomes O(n log n). This results in a ~10% >> performance drop in single-process scenarios, and I'm not entirely sure if >> this trade-off is worthwhile. 🤔 >> >> Additionally, by attempting to merge before inserting in >> ext4_mb_free_metadata(), we can eliminate contention on sbi->s_md_lock >> during merges, resulting in roughly a 5% performance gain. >>> - Or maybe it is that the lazyinit thread has not yet initialized all >>> the buddies yet which means we have lesser BGs in the freefrag list >>> or the order list used by faster CRs. Hence, if they are locked we >>> are falling more to CR_GOAL_LEN_SLOW. To check if this is the case, >>> one hack is to cat /proc/fs/ext4/<disk>/mb_groups (or something along >>> the lines) before the benchmark, which forces init of all the group >>> buddies thus populating all the lists used by mb_opt_scan. Maybe we >>> can check if this gives better results. >> All groups are already initialized at the time of testing, and that's not >> where the problem lies. >>> 3. Also, how much IO are we doing here, are we filling the whole FS? >>> >> In a single container, create a file, then repeatedly append 8KB using >> fallocate until the file reaches 1MB. After that, truncate the file to >> 0 and continue appending 8KB with fallocate. The 64 containers will >> occupy a maximum of 64MB of disk space in total, so they won't fill the >> entire file system. > Also, as per your theory, if we do similar experiments in a fragmented > FS we should see opt_scan perform better right? I'd like to test this as > well. I'll try to play with the scripts you have shared. > Yes, mb_optimize_scan performs well when free space fragmentation is severe. We have a 600TB disk array where the write bandwidth is approximately 5 GB/s when empty. When space utilization reaches 95%, linear traversal drops bandwidth to 1 GB/s, whereas enabling mb_optimize_scan restores it to 4.2 GB/s. Cheers, Baokun
Powered by blists - more mailing lists