lists.openwall.net | lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC | |
Open Source and information security mailing list archives
| ||
|
Message-Id: <20250523085821.1329392-1-libaokun@huaweicloud.com> Date: Fri, 23 May 2025 16:58:17 +0800 From: libaokun@...weicloud.com To: linux-ext4@...r.kernel.org Cc: tytso@....edu, adilger.kernel@...ger.ca, jack@...e.cz, linux-kernel@...r.kernel.org, yi.zhang@...wei.com, yangerkun@...wei.com, libaokun1@...wei.com, libaokun@...weicloud.com Subject: [PATCH 0/4] ext4: better scalability for ext4 block allocation From: Baokun Li <libaokun1@...wei.com> Since servers have more and more CPUs, and we're running more containers on them, we've been using will-it-scale to test how well ext4 scales. The fallocate2 test (append 8KB to 1MB, truncate to 0, repeat) run concurrently on 64 containers revealed significant contention in block allocation/free, leading to much lower aggregate fallocate OPS compared to a single container (see below). 1 | 2 | 4 | 8 | 16 | 32 | 64 -------|--------|--------|--------|--------|--------|------- 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 The main bottleneck was the ext4_lock_group(), which both block allocation and free fought over. While the block group for block free is fixed and unoptimizable, the block group for allocation is selectable. Consequently, the ext4_try_lock_group() helper function was added to avoid contention on busy groups, and you can see more in Patch 1. After we fixed the ext4_lock_group bottleneck, another one showed up: s_md_lock. This lock protects different data when allocating and freeing blocks. We got rid of the s_md_lock call in block allocation by making stream allocation work per inode instead of globally. You can find more details in Patch 2. Patches 3 and 4 are just some minor cleanups. Performance test data follows: CPU: HUAWEI Kunpeng 920 Memory: 480GB Disk: 480GB SSD SATA 3.2 Test: Running will-it-scale/fallocate2 on 64 CPU-bound containers. Observation: Average fallocate operations per container per second. |--------|--------|--------|--------|--------|--------|--------|--------| | - | 1 | 2 | 4 | 8 | 16 | 32 | 64 | |--------|--------|--------|--------|--------|--------|--------|--------| | base | 295287 | 70665 | 33865 | 19387 | 10104 | 5588 | 3588 | |--------|--------|--------|--------|--------|--------|--------|--------| | linear | 286328 | 123102 | 119542 | 90653 | 60344 | 35302 | 23280 | | | -3.0% | 74.20% | 252.9% | 367.5% | 497.2% | 531.6% | 548.7% | |--------|--------|--------|--------|--------|--------|--------|--------| |mb_optim| 292498 | 133305 | 103069 | 61727 | 29702 | 16845 | 10430 | |ize_scan| -0.9% | 88.64% | 204.3% | 218.3% | 193.9% | 201.4% | 190.6% | |--------|--------|--------|--------|--------|--------|--------|--------| Running "kvm-xfstests -c ext4/all -g auto" showed that 1k/generic/347 often fails. The test seems to think that a dm-thin device with a virtual size of 5000M but a real size of 500M, after being formatted as ext4, would have 500M free. But it doesn't – we run out of space after making about 430 1M files. Since the block size is 1k, making so many files turns on dir_index, and dm-thin waits a minute, sees no free space, and then throws IO error. This can cause a directory index block to fail to write and abort journal. What's worse is that _dmthin_check_fs doesn't replay the journal, so fsck finds inconsistencies and the test failed. I think the problem is with the test itself, and I'll send a patch to fix it soon. Comments and questions are, as always, welcome. Thanks, Baokun Baokun Li (4): ext4: add ext4_try_lock_group() to skip busy groups ext4: move mb_last_[group|start] to ext4_inode_info ext4: get rid of some obsolete EXT4_MB_HINT flags ext4: fix typo in CR_GOAL_LEN_SLOW comment fs/ext4/ext4.h | 38 ++++++++++++++++++------------------- fs/ext4/mballoc.c | 34 +++++++++++++++++++-------------- fs/ext4/super.c | 2 ++ include/trace/events/ext4.h | 3 --- 4 files changed, 41 insertions(+), 36 deletions(-) -- 2.46.1
Powered by blists - more mailing lists