linux-ext4 - Performance issue with recently_deleted() /no

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <MN2PR11MB45667C6E534F7944BFA77684DB550@MN2PR11MB4566.namprd11.prod.outlook.com>
Date:   Thu, 27 Aug 2020 12:09:21 +0000
From:   "James Scriven (jamscriv)" <jamscriv@...co.com>
To:     "linux-ext4@...r.kernel.org" <linux-ext4@...r.kernel.org>
Subject: Performance issue with recently_deleted() /no_journal with huge
 directories

Hi, I'm working on migrating a workload from kernel 2.6 to 4.18 (REHL6 to RHEL8).

The use case is a build farm that has a basic workflow of:

1) rm -rf a large directory tree (about 2M files ~ 200GB) to free some space
2) download and extract a large tarbar (about 2M files ~ 200GB)
3) perform a build in the extracted directory tree
Repeat...

We've being using an ext4 filesystem with no journal for maximum performance with great success. We're not very concerned about losing data, but do want some persistence, which is why we don't just use tmpfs for this. We'll keep a number of these large workspaces around as long as space permits, and delete the oldest (step 1) just before starting a new one (step 2). 

When migrating to this newer kernel, we are seeing performance degradation when we expand the tar, which I suspect is caused by inode allocation trying to find an unused inode that has not been used too recently. Since we have 2M deleted inodes that *have* been recently deleted, every one of the new 2M inodes has to search through all 2M of the deleted ones (or something to that approximation - my full understanding of the ext4 code is limited).

The simple testcase below shows the issue. My question is, is this edge case already understood? Is there a good way to re-gain this lost performance? Adding a "sync + drop_caches", or a sufficiently long sleep, between steps 1 and 2 will work around the issue, but is not ideal.

# each iteration of the test case the number of recently deleted inodes increases and performance degrades.

$ uname -a
Linux sjc-asr-bm-470 4.18.0-147.3.1.el8_1.x86_64 #1 SMP Wed Nov 27 01:11:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ sync; echo 3 | sudo tee /proc/sys/vm/drop_caches; for x in {1..10}; do rm -rf dirtree; mkdir dirtree; time mkdir dirtree/{1..50000}; done
3

real    0m1.796s
user    0m0.041s
sys     0m1.528s

real    0m3.280s
user    0m0.035s
sys     0m3.235s

real    0m4.329s
user    0m0.035s
sys     0m4.279s

real    0m6.033s
user    0m0.032s
sys     0m5.988s

real    0m7.303s
user    0m0.041s
sys     0m7.246s

real    0m7.874s
user    0m0.032s
sys     0m7.826s

real    0m9.376s
user    0m0.036s
sys     0m9.323s

real    0m9.979s
user    0m0.052s
sys     0m9.910s

real    0m9.808s
user    0m0.037s
sys     0m9.749s

real    0m9.067s
user    0m0.038s
sys     0m9.011s

$ uname -a
Linux sjc-asr-bm-100 2.6.32-754.17.1.el6.x86_64 #1 SMP Thu Jun 20 11:47:12 EDT 2019 x86_64 x86_64 x86_64 GNU/Linux
$ sync; echo 3 | sudo tee /proc/sys/vm/drop_caches; for x in {1..10}; do rm -rf dirtree; mkdir dirtree; time mkdir dirtree/{1..50000}; done
3

real    0m0.724s
user    0m0.031s
sys     0m0.693s

real    0m0.762s
user    0m0.041s
sys     0m0.721s

real    0m0.717s
user    0m0.043s
sys     0m0.674s

real    0m0.712s
user    0m0.037s
sys     0m0.675s

real    0m0.749s
user    0m0.036s
sys     0m0.712s

real    0m0.710s
user    0m0.040s
sys     0m0.670s

real    0m0.746s
user    0m0.038s
sys     0m0.707s

real    0m0.715s
user    0m0.034s
sys     0m0.680s

real    0m0.747s
user    0m0.040s
sys     0m0.707s

real    0m0.732s
user    0m0.042s
sys     0m0.690s