lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-ID: <5oqysbekjn7vazkzrh4lgtg25vqqxgrugvld6av7r2nx7dbghr@kk4yidjw735c> Date: Mon, 2 Jun 2025 17:44:30 +0200 From: Jan Kara <jack@...e.cz> To: Baokun Li <libaokun1@...wei.com> Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, tytso@....edu, adilger.kernel@...ger.ca, linux-kernel@...r.kernel.org, yi.zhang@...wei.com, yangerkun@...wei.com, libaokun@...weicloud.com Subject: Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info Hello! On Fri 30-05-25 17:31:48, Baokun Li wrote: > On 2025/5/29 20:56, Jan Kara wrote: > > On Fri 23-05-25 16:58:19, libaokun@...weicloud.com wrote: > > > From: Baokun Li <libaokun1@...wei.com> > > > > > > After we optimized the block group lock, we found another lock > > > contention issue when running will-it-scale/fallocate2 with multiple > > > processes. The fallocate's block allocation and the truncate's block > > > release were fighting over the s_md_lock. The problem is, this lock > > > protects totally different things in those two processes: the list of > > > freed data blocks (s_freed_data_list) when releasing, and where to start > > > looking for new blocks (mb_last_[group|start]) when allocating. > > > > > > Moreover, when allocating data blocks, if the first try (goal allocation) > > > fails and stream allocation is on, it tries a global goal starting from > > > the last group we used (s_mb_last_group). This can make things faster by > > > writing blocks close together on the disk. But when many processes are > > > allocating, they all fight over s_md_lock and might even try to use the > > > same group. This makes it harder to merge extents and can make files more > > > fragmented. If different processes allocate chunks of very different sizes, > > > the free space on the disk can also get fragmented. A small allocation > > > might fit in a partially full group, but a big allocation might have > > > skipped it, leading to the small IO ending up in a more empty group. > > > > > > So, we're changing stream allocation to work per inode. First, it tries > > > the goal, then the last group where that inode successfully allocated a > > > block. This keeps an inode's data closer together. Plus, after moving > > > mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during > > > block allocation anymore because we already have the write lock on > > > i_data_sem. This gets rid of the contention between allocating and > > > releasing blocks, which gives a huge performance boost to fallocate2. > > > > > > Performance test data follows: > > > > > > CPU: HUAWEI Kunpeng 920 > > > Memory: 480GB > > > Disk: 480GB SSD SATA 3.2 > > > Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. > > > Observation: Average fallocate operations per container per second. > > > > > > base patched > > > mb_optimize_scan=0 6755 23280 (+244.6%) > > > mb_optimize_scan=1 4302 10430 (+142.4%) > > > > > > Signed-off-by: Baokun Li <libaokun1@...wei.com> > > Good spotting with the s_md_lock contention here. But your changes don't > > quite make sense to me. The idea of streaming allocation in mballoc is to > > have an area of filesystem for large files to reduce fragmentation. When > > you switch to per-inode, this effect of packing large files together goes > > away. Futhermore for each inode either all allocations will be very likely > > streaming or not streaming (the logic uses file size) so either your > > per-inode target will be unused or just another constantly used copy of > > goal value. > Sorry, I didn't intend to break streaming allocation's semantics. > A precise definition of streaming allocation is not found in the > existing code, documentation, or historical records. Consequently, > my previous understanding of it was somewhat inaccurate. > > I previously thought it was used to optimize the efficiency of linear > traversal. For instance, if 500 inodes are created in group 0 and each > file is sequentially filled to 1GB, each file's goal, being empty, would > be group 1 (the second group in the inode's flex_bg). > > Without a global goal and in the absence of non-linear traversal, > after the first inode is filled, the second inode would need to traverse > groups 1 through 8 to find its first free block. > > This inefficiency escalates, eventually requiring the 500th inode to > potentially traverse almost 4000 block groups to find its first available > block. I see. But doesn't ext4_mb_choose_next_group() usually select group from which allocation can succeed instead of linearly scanning through all the groups? The linear scan is just a last resort as far as I remember. Anyway I'm not 100% sure what was the original motivation for the streaming goal. Maybe Andreas would remember since he was involved in the design. What I wrote is mostly derived from the general understanding of mballoc operating principles but I could be wrong. > I initially believed it could be split to the inode level to reduce > traversal time and file fragmentation. However, as you pointed out, > its purpose is to cluster large files, not data blocks within a file. > Given this, splitting it to the inode level no longer makes sense. > > So I can see two sensible solutions here: > > a) Drop streaming allocations support altogether. > As mentioned above, it can also greatly improve the efficiency of linear > traversal, so we can't simply remove it. > > > > b) Enhance streaming allocation support to avoid contention between > > processes allocating in parallel and freeing. Frankly, there's no strong > > reason why reads & writes of streaming allocation goal need to use a > > spinlock AFAICS. > Yes, since it's just a hint, we don't need a lock at all, not even > fe_start, we just need the last fe_group. > > We could just store a physical block number and use > > atomic64 accessors for it? Also having single goal value is just causing > > more contention on group locks for parallel writers that end up using it > > (that's the problem I suspect you were hitting the most). > Spot on! We did try a single, lockless atomic64 variable, and just as > you pointed out, all processes started traversing from the very same > group, which just cranked up the contention, dropping OPS to just 6745. > > So perhaps we > > can keep multiple streaming goal slots in the superblock (scale the count > > based on CPU count & filesystem group count) and just pick the slot based > > on inode number hash to reduce contention? > > > > Honza > That's a brilliant idea, actually! > > Since most containers are CPU-pinned, this would naturally cluster a single > container's data blocks together. I believe we can also apply this to inode > allocation, so a container's inodes and data are all in a single region, > significantly reducing interference between containers. > > My gratitude for your valuable suggestion! > I'm going to try out the CPU bucketing approach. Cool, let's see how it works out :). Honza -- Jan Kara <jack@...e.com> SUSE Labs, CR
Powered by blists - more mailing lists