linux-kernel - Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <5oqysbekjn7vazkzrh4lgtg25vqqxgrugvld6av7r2nx7dbghr@kk4yidjw735c>
Date: Mon, 2 Jun 2025 17:44:30 +0200
From: Jan Kara <jack@...e.cz>
To: Baokun Li <libaokun1@...wei.com>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, tytso@....edu, 
	adilger.kernel@...ger.ca, linux-kernel@...r.kernel.org, yi.zhang@...wei.com, 
	yangerkun@...wei.com, libaokun@...weicloud.com
Subject: Re: [PATCH 2/4] ext4: move mb_last_[group|start] to ext4_inode_info

Hello!

On Fri 30-05-25 17:31:48, Baokun Li wrote:
> On 2025/5/29 20:56, Jan Kara wrote:
> > On Fri 23-05-25 16:58:19, libaokun@...weicloud.com wrote:
> > > From: Baokun Li <libaokun1@...wei.com>
> > > 
> > > After we optimized the block group lock, we found another lock
> > > contention issue when running will-it-scale/fallocate2 with multiple
> > > processes. The fallocate's block allocation and the truncate's block
> > > release were fighting over the s_md_lock. The problem is, this lock
> > > protects totally different things in those two processes: the list of
> > > freed data blocks (s_freed_data_list) when releasing, and where to start
> > > looking for new blocks (mb_last_[group|start]) when allocating.
> > > 
> > > Moreover, when allocating data blocks, if the first try (goal allocation)
> > > fails and stream allocation is on, it tries a global goal starting from
> > > the last group we used (s_mb_last_group). This can make things faster by
> > > writing blocks close together on the disk. But when many processes are
> > > allocating, they all fight over s_md_lock and might even try to use the
> > > same group. This makes it harder to merge extents and can make files more
> > > fragmented. If different processes allocate chunks of very different sizes,
> > > the free space on the disk can also get fragmented. A small allocation
> > > might fit in a partially full group, but a big allocation might have
> > > skipped it, leading to the small IO ending up in a more empty group.
> > > 
> > > So, we're changing stream allocation to work per inode. First, it tries
> > > the goal, then the last group where that inode successfully allocated a
> > > block. This keeps an inode's data closer together. Plus, after moving
> > > mb_last_[group|start] to ext4_inode_info, we don't need s_md_lock during
> > > block allocation anymore because we already have the write lock on
> > > i_data_sem. This gets rid of the contention between allocating and
> > > releasing blocks, which gives a huge performance boost to fallocate2.
> > > 
> > > Performance test data follows:
> > > 
> > > CPU: HUAWEI Kunpeng 920
> > > Memory: 480GB
> > > Disk: 480GB SSD SATA 3.2
> > > Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers.
> > > Observation: Average fallocate operations per container per second.
> > > 
> > >                        base     patched
> > > mb_optimize_scan=0    6755     23280 (+244.6%)
> > > mb_optimize_scan=1    4302     10430 (+142.4%)
> > > 
> > > Signed-off-by: Baokun Li <libaokun1@...wei.com>
> > Good spotting with the s_md_lock contention here. But your changes don't
> > quite make sense to me. The idea of streaming allocation in mballoc is to
> > have an area of filesystem for large files to reduce fragmentation.  When
> > you switch to per-inode, this effect of packing large files together goes
> > away. Futhermore for each inode either all allocations will be very likely
> > streaming or not streaming (the logic uses file size) so either your
> > per-inode target will be unused or just another constantly used copy of
> > goal value.
> Sorry, I didn't intend to  break streaming allocation's semantics.
> A precise definition of streaming allocation is not found in the
> existing code, documentation, or historical records. Consequently,
> my previous understanding of it was somewhat inaccurate.
> 
> I previously thought it was used to optimize the efficiency of linear
> traversal. For instance, if 500 inodes are created in group 0 and each
> file is sequentially filled to 1GB, each file's goal, being empty, would
> be group 1 (the second group in the inode's flex_bg).
> 
> Without a global goal and in the absence of non-linear traversal,
> after the first inode is filled, the second inode would need to traverse
> groups 1 through 8 to find its first free block.
> 
> This inefficiency escalates, eventually requiring the 500th inode to
> potentially traverse almost 4000 block groups to find its first available
> block.

I see. But doesn't ext4_mb_choose_next_group() usually select group from
which allocation can succeed instead of linearly scanning through all the
groups? The linear scan is just a last resort as far as I remember. Anyway
I'm not 100% sure what was the original motivation for the streaming goal.
Maybe Andreas would remember since he was involved in the design.  What I
wrote is mostly derived from the general understanding of mballoc operating
principles but I could be wrong.

> I initially believed it could be split to the inode level to reduce
> traversal time and file fragmentation. However, as you pointed out,
> its purpose is to cluster large files, not data blocks within a file.
> Given this, splitting it to the inode level no longer makes sense.
> > So I can see two sensible solutions here:
> > a) Drop streaming allocations support altogether.
> As mentioned above, it can also greatly improve the efficiency of linear
> traversal, so we can't simply remove it.
> > 
> > b) Enhance streaming allocation support to avoid contention between
> > processes allocating in parallel and freeing. Frankly, there's no strong
> > reason why reads & writes of streaming allocation goal need to use a
> > spinlock AFAICS.
> Yes, since it's just a hint, we don't need a lock at all, not even
> fe_start, we just need the last fe_group.
> > We could just store a physical block number and use
> > atomic64 accessors for it? Also having single goal value is just causing
> > more contention on group locks for parallel writers that end up using it
> > (that's the problem I suspect you were hitting the most).
> Spot on! We did try a single, lockless atomic64 variable, and just as
> you pointed out, all processes started traversing from the very same
> group, which just cranked up the contention, dropping OPS to just 6745.
> >   So perhaps we
> > can keep multiple streaming goal slots in the superblock (scale the count
> > based on CPU count & filesystem group count) and just pick the slot based
> > on inode number hash to reduce contention?
> > 
> > 								Honza
> That's a brilliant idea, actually!
> 
> Since most containers are CPU-pinned, this would naturally cluster a single
> container's data blocks together. I believe we can also apply this to inode
> allocation, so a container's inodes and data are all in a single region,
> significantly reducing interference between containers.
> 
> My gratitude for your valuable suggestion!
> I'm going to try out the CPU bucketing approach.

Cool, let's see how it works out :).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR