linux-ext4 - Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce contention

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <77077598-45d6-43dd-90a0-f3668a27ca15@huawei.com>
Date: Mon, 30 Jun 2025 14:50:30 +0800
From: Baokun Li <libaokun1@...wei.com>
To: Jan Kara <jack@...e.cz>
CC: <linux-ext4@...r.kernel.org>, <tytso@....edu>, <adilger.kernel@...ger.ca>,
	<ojaswin@...ux.ibm.com>, <linux-kernel@...r.kernel.org>,
	<yi.zhang@...wei.com>, <yangerkun@...wei.com>, Baokun Li
	<libaokun1@...wei.com>
Subject: Re: [PATCH v2 04/16] ext4: utilize multiple global goals to reduce
 contention

On 2025/6/28 2:31, Jan Kara wrote:
> On Mon 23-06-25 15:32:52, Baokun Li wrote:
>> When allocating data blocks, if the first try (goal allocation) fails and
>> stream allocation is on, it tries a global goal starting from the last
>> group we used (s_mb_last_group). This helps cluster large files together
>> to reduce free space fragmentation, and the data block contiguity also
>> accelerates write-back to disk.
>>
>> However, when multiple processes allocate blocks, having just one global
>> goal means they all fight over the same group. This drastically lowers
>> the chances of extents merging and leads to much worse file fragmentation.
>>
>> To mitigate this multi-process contention, we now employ multiple global
>> goals, with the number of goals being the CPU count rounded up to the
>> nearest power of 2. To ensure a consistent goal for each inode, we select
>> the corresponding goal by taking the inode number modulo the total number
>> of goals.
>>
>> Performance test data follows:
>>
>> Test: Running will-it-scale/fallocate2 on CPU-bound containers.
>> Observation: Average fallocate operations per container per second.
>>
>>                     | Kunpeng 920 / 512GB -P80|  AMD 9654 / 1536GB -P96 |
>>   Disk: 960GB SSD   |-------------------------|-------------------------|
>>                     | base  |    patched      | base  |    patched      |
>> -------------------|-------|-----------------|-------|-----------------|
>> mb_optimize_scan=0 | 7612  | 19699 (+158%)   | 21647 | 53093 (+145%)   |
>> mb_optimize_scan=1 | 7568  | 9862  (+30.3%)  | 9117  | 14401 (+57.9%)  |
>>
>> Signed-off-by: Baokun Li <libaokun1@...wei.com>
> ...
>
>> +/*
>> + * Number of mb last groups
>> + */
>> +#ifdef CONFIG_SMP
>> +#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
>> +#else
>> +#define MB_LAST_GROUPS 1
>> +#endif
>> +
> I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
> distribution kernels (it is just a theoretical maximum for the number of
> CPUs the kernel can support)

nr_cpu_ids is generally equal to num_possible_cpus(). Only when
CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
which represents the maximum number of supported CPUs.

> which seems like far too much for small
> filesystems with say 100 block groups.

It does make sense.

> I'd rather pick the array size like:
>
> min(num_possible_cpus(), sbi->s_groups_count/4)
>
> to
>
> a) don't have too many slots so we still concentrate big allocations in
> somewhat limited area of the filesystem (a quarter of block groups here).
>
> b) have at most one slot per CPU the machine hardware can in principle
> support.
>
> 								Honza

You're right, we should consider the number of block groups when setting
the number of global goals.

However, a server's rootfs can often be quite small, perhaps only tens of
GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
still limit the filesystem's scalability. Furthermore, after supporting
LBS, the number of block groups will sharply decrease.

How about we directly use sbi->s_groups_count (which would effectively be
min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
avoid zero values.


Cheers,
Baokun