linux-ext4 - Parallel fsck performance degradation case discussion

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <20220301025706.e5vxlanadb2ppwvv@riteshh-domain>
Date:   Tue, 1 Mar 2022 08:27:28 +0530
From:   Ritesh Harjani <riteshh@...ux.ibm.com>
To:     Andreas Dilger <adilger@...ger.ca>
Cc:     "Theodore Ts'o" <tytso@....edu>, Wang Shilong <wshilong@....com>,
        Harshad Shirwadkar <harshadshirwadkar@...il.com>,
        Jan Kara <jack@...e.cz>,
        linux-ext4 <linux-ext4@...r.kernel.org>
Subject: Parallel fsck performance degradation case discussion

Hello,

I am working to help merge ext4's parallel fsck in upstream e2fsprogs.
Ted has provided some details here[1] on some of the work needed, to get it
accepted/merged into upstream.

However, in this email, I mostly wanted to discuss some performance(perf) observations
and to check if we have done our multi-thread fsck testing on such test cases or not.

So, I was doing some testing with different FS layouts and with different disk types
to see its performance benefits. Here are some of the observations. I wanted to know
if it is in line with your observations too.
Also to mainly discuss Case-4, to see if it is already a known limitation.

Case-1: Huge no. of 0 byte sized inodes (22M inodes)
We do see performance benefits with pfsck in this use case (I saw around 3x improvement with ramfs).
This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
on HDD and on NVMes (perf improvements can vary based on disk types too).

Case-2: Huge no. of 4KB-32KB sized inodes/directories (22M inodes)
We do see performance benefits with pfsck in this use case as well (again around 3x improvement with ramfs).
This is also true for all disk/device setups i.e. ramfs based ext4 FS using loop device,
on HDD and on NVMes (perf improvements can vary based on disk types).

Case-3: Large directories (with many 0 byte files within these directories)
In this case, mostly pass-2 takes significant time, but again we do see performance
improvements with pass-1 for all different disk/device setups.

Case-4: Files with heavy fragmentation i.e. lots of extents.
(creating this FS layout roughly by running script1.sh followed by script2.sh mentioned at the end of this email)
In this case we start seeing performance degradation if the I/O device is fast enough.
1. On a single HDD, we see significant perf reduction > ~30% (with pfsck compare to non pfsck).
2. With single nvme, similar perf reduction or more.
3. ramfs based single loop device setup - ~100% perf reduction.
4. ramfs based 4 loop devices with dm_delay on top and with SW raid0 config (md0) (i.e. with 4 dm-delay devices of 50G each in raid0).
    a. With delay of 0ms we see a performance degradation of around ~100%. (10s v/s 20s)
       Below is the perf profile where the performance degradation is seen (with pfsck -m 4)
		   26.37%  e2fsck  e2fsck              [.] rb_insert_extent
		   13.54%  e2fsck  e2fsck              [.] ext2fs_rb_next
			9.72%  e2fsck  libc-2.31.so        [.] _int_free
			7.83%  e2fsck  libc-2.31.so        [.] malloc
			7.45%  e2fsck  e2fsck              [.] rb_test_clear_bmap_extent
			6.46%  e2fsck  e2fsck              [.] rb_test_bmap
			4.60%  e2fsck  libpthread-2.31.so  [.] __pthread_rwlock_rdlock
			4.39%  e2fsck  libpthread-2.31.so  [.] __pthread_rwlock_unlock

    b. But with above disk setup (4 dm-delay with raid0), ~36% to 3x performance improvement is observed when the
	   delay is within the range of [1ms - 500ms] (for every read/write).

Now, I understand we might say that parallel fsck benefits can mostly be seen in case of parallel I/O.
Because otherwise, pfsck might add some extra overhead due to thread spawning, allocating per thread
structures and merge logic. But should that account to significant perf degradation in such fragmented files use case?

>From my observations so far, I see in case-4.a), most of the time is being spent in merging of block_found_map bitmap.
On measuring some stats and when testing with -m 1 (i.e. thread-0), I see e2fsck_pass1_merge_context() alone
taking 18sec out of 32sec (which is total time for pass-1).

<stats log>
============
[Thread 0] Scanned group range [0, 1599), inodes 169076
e2fsck_pass1_merge_context [0]: bg range [0, 1599] elapsed time: 18.580 count=25573571
elapsed time: 32.863

"count" in above stat measures total no. of extent entries found in thread_ctx->block_found_map
(by adding rb_count_bmap() function). Since there is only one thread here, that also means it is the total no.
of extent entries. Above data is shown with "-m 1", to just show the exact count entries.
Otherwise too with "-m 4", the performance is degraded.

I have also tested this on raid0 using 2 HDDs, and on that too perf degradation was observed.
(Although I don't have the exact data handy for this, but I can get those again, if needed).
But AFAIK, it was definitely a significant reduction in perf numbers.

So I was wondering if this is a known limitation around pfsck and if it has popped up in any of your tests too.
Also please do let me know if I have missed anything obvious here?

In some of my earlier testing, I had tested with lusture e2fsprogs (master-pfsck branch) and had similar observations
as mentioned above. But recently all my tests were based out of the following tree[2] (with patch[3] included).
I have these setups available with me, so if anything is needed to be tested from my end, I can do that.

References
============
[1]: https://lore.kernel.org/all/YMN10sXgoTR%2FIPxr@mit.edu/
[2]: https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/log/?h=pfsck
[3] https://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git/commit/?id=699di448eee4b991acafaae4e4f8222be332d6837


Thanks for your help!!
-ritesh

--

<script1.sh>
============
fragmented_filesize=$((10 * 1024 * 1024 * 1024))
dir_cnt=0
while [ $dir_cnt -lt 8192 ]; do
   mkdir $MNT/n$dir_cnt || break
   inode_cnt=0
   while [ $inode_cnt -lt 8192 ]; do
       if [ $inode_cnt -eq 0 ]; then
           xfs_io -fc "falloc 0 $fragmented_filesize" $MNT/n$dir_cnt/n$inode_cnt
       else
           touch $MNT/n$dir_cnt/n$inode_cnt || break
       fi
       inode_cnt=$((inode_cnt+1))
   done
   dir_cnt=$((dir_cnt+1))
done
exit

<script2.sh>
==============
dir_cnt=0
while [ $dir_cnt -lt 8192 ]; do
    inode_cnt=0
    $XFSTESTS_PATH/src/punch-alternating $MNT/n$dir_cnt/n$inode_cnt
    dir_cnt=$((dir_cnt+1))
done