linux-ext4 - Re: [PATCH v2 3/3] ext4: reimplement ext4_empty_dir() using is_dirent_block

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250928210231.GB274922@mit.edu>
Date: Sun, 28 Sep 2025 17:02:31 -0400
From: "Theodore Ts'o" <tytso@....edu>
To: Julian Sun <sunjunchao2870@...il.com>
Cc: Harshad Shirwadkar <harshadshirwadkar@...il.com>, adilger@...ger.ca,
        jack@...e.cz, Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v2 3/3] ext4: reimplement ext4_empty_dir() using
 is_dirent_block_empty

On Sun, Sep 28, 2025 at 02:51:09PM +0800, Julian Sun wrote:
> Emm. I checked the code and found that support for 3-level htrees was
> added in 2017 via commit e08ac99fa2a2 ("ext4: add largedir feature"),
> but this patch was submitted in 2020. Did I make a mistake somewhere?

I made that statement when I tried forward porting Harshad's patches
to 6.17-rc4, and one of the patch conflicts seemed to implyyy that
dx_probe predated the3-level htree change.  But I could have been
mistaken about why the patch conflict eixsts.

> > There are
> > also some hardening against maliciously fuzzed file systems that will
> > prevent the patches from applying cleanly.
> 
> Is this included in the xfstests test suite?

For better or for worse, no.  The reason for this is that xfstests
tends to avoid using pre-generated file systems becase they aren't
portable to different file system types.  And in order to test
deliberately corrupted, you generally need to include small corrupted
file systems into xfstests.  You *could* try to create them as an
ext4-specific test using debugfs to introduce the corruption, but
that's quite painful, and in some cases the only way to corrupt the
file system in that very specific way would require a hex editor.

That being said, it's generally pretty obvious when bullet-proofing
code has been added and it causes a merge conflict.  For example, I
skipped forward-porting the third patch, "ext4: reimplement
ext4_empty_dir() using is_dirent_block_empty" because there was some
bullet-proofing code added in ext4_empty_dir() which caused the patch
application to fail, and I was too third try to deal with it, and it
strictly speaking wasn't necessary for the patch shrinking
functionality.  That being said, the bullet proofing which was added
to ext4_empty_dir(), *should* be added to the newly created
i_dirent_block_empty() function introduced in the 2/3 patch in this
series.

> > Then we'd need to run regression tests on a variety of different ext4
> > configurations to see if there are any regressions, and if so, they
> > would need to be fixed.
> 
> Is testing with xfstests sufficient? Or are there any other test
> suites that can be used to test this patch set?

Testing with xfstests are sufficient, but we need to test multiple
file system configurations, but just the default "ext4 using a 4k
blocksize".  So I tried to do a quick-and-dirty forward port of the
patches, and they compiled and passed the 15-20 minute smoke test
(running "kvm-xfstests smoke", or "gce-xfstests smoke").  But when I
ran the full set of tests, this is what I found.  

ext4/4k: 594 tests, 4 failures, 65 skipped, 5196 seconds
  Failures: generic/426 generic/756
  Flaky: generic/041: 20% (1/5)   generic/049: 20% (1/5)
ext4/1k: 588 tests, 3 failures, 69 skipped, 6306 seconds
  Failures: generic/426 generic/756
  Flaky: generic/043: 20% (1/5)
ext4/ext3: 586 tests, 3 failures, 149 skipped, 4958 seconds
  Flaky: generic/041: 40% (2/5)   generic/426: 60% (3/5)   
    generic/756: 60% (3/5)
ext4/encrypt: 569 tests, 3 failures, 181 skipped, 3263 seconds
  Failures: generic/426 generic/756
  Flaky: generic/049: 40% (2/5)
ext4/nojournal: 586 tests, 3 failures, 137 skipped, 3918 seconds
  Failures: ext4/045 generic/426 generic/756
ext4/ext3conv: 591 tests, 4 failures, 67 skipped, 5408 seconds
  Failures: generic/426 generic/756
  Flaky: ext4/045: 40% (2/5)   generic/040: 60% (3/5)
ext4/adv: 587 tests, 10 failures, 73 skipped, 5487 seconds
  Failures: [generic/347] generic/426 generic/756 [generic/757] [generic/764]
    [generic/777]
  Flaky: generic/047: 40% (2/5)   generic/475: 20% (1/5)   
    [generic/482: 40% (2/5)]   generic/547: 20% (1/5)
ext4/dioread_nolock: 592 tests, 3 failures, 65 skipped, 4996 seconds
  Failures: generic/426 generic/756
  Flaky: generic/047: 40% (2/5)
ext4/data_journal: 587 tests, 6 failures, 141 skipped, 5151 seconds
  Failures: [generic/127] generic/426 generic/756
  Flaky: generic/049: 60% (3/5)   generic/475: 60% (3/5)   
    generic/645: 40% (2/5)
ext4/bigalloc_4k: 565 tests, 4 failures, 68 skipped, 5066 seconds
  Failures: ext4/045
  Flaky: generic/320: 60% (3/5)   generic/426: 60% (3/5)   
    generic/756: 60% (3/5)
ext4/bigalloc_1k: 566 tests, 3 failures, 79 skipped, 5096 seconds
  Failures: generic/426 generic/756
  Flaky: generic/049: 20% (1/5)
ext4/dax: 578 tests, 5 failures, 170 skipped, 3198 seconds
  Failures: [generic/344] [generic/363] generic/426 generic/756
  Flaky: generic/043: 20% (1/5)
Totals: 7193 tests, 1264 skipped, 190 failures, 0 errors, 53862s

(Tests in square brackets were failing with stock 6.17-rc4, so don't
represent regressions.  Ignore them for the purposes of trying to fix
up this patch set.)

Now, some of thsee may be failures caused by my making a mistake when
doing the forward port.  Basically, I bashed the patches until they
built; and then I ran the smoke test; and then I kicked of the full
test run, which takes about 2.5 hours using a dozen VM's.

I don't have time to take this any further, so with your permission,
if you're OK with my handing fiishing off this project to you, I'll
send the patches as a reply to this message, and then "Your mission,
should you use to accept it" (quoting from the "Mission Impossible" TV
Show / Movie) would be:

1) Investigate the failures deailed above, and fix them.  Again, some
of these failures may not be Harshads; it could be mine when I did the
forward port.  Either way, we need to fix them before we can merge the
changes upstream.

2) Do more cleanups on the patches as necessary.  While doing the
forward port, I noted the following issues that should really be
fixed.  You might find more things that need improvement; if so,
please go for it!

2a) Although we are potentially modifying more metadata blocks in
ext4_try_dir_shrink(), the number of journal credits requested by
callers to ext4_delete_entry() were not increased.  This could result
in the handle running out of credits, which will trigger a warning,
and if the transaction runs out of space, could trigger a journal
abort.  One way would simply be to increase the credits passed to
ext4_journal_start().

The other way would be to queue up the work and do the directory
shrinking in workqueue, where the changes would be done in a
completely separate journal handle.  This has the advantage that it
avoids increasing the latency to the unlink() system call, since there
is no reason why the change needs to happen as part of the system
call, or in the same transaction as the unlink.

2b) Error checking is missing in some of the newly added code.  In
particular the function calls in make_unindexed() are ignoring error
returns.  More generally, while we do the directory shrnk, we need to
check for potential problems; and if there are any failures, we need
to leave the directory unchanged, possibly calling ext4_error() if
file system corruption is dicovered, or just skipping the directory
shirnk operation if it is some transient error --- for example, a
memory allocation failure.

2c) As mentioned above, I skipped forward porting the 3rd patch
series, and there is code to rigorously check for inconsistent file
system corruptions lest we trigger a BUG or WARN or worse which is
causing the patch application to fail.  That checking needs to be
added to is_dirent_block_empty(), and while you're at it, please check
all of the newly added code to make sure it won't misbehave if the
file system structures have been corrupted in a particualrly inventive
way.

2d)  Once (2c) is done forward port patch #3.

3) As discussed in this thread, once these patches are landed, the
further work to merge leaf notes; to remove empty htree nodes; and to
shorten the htree depth when necessary.

> > Also, please note that this first set of changes doesn't really make a
> > big difference for real-world use casses, since a directory block
> > won't get dropped when it is completely empty....
> 
> Yes, I think the biggest beneficiary is rm -rf-type workloads.

The thing about rm -rf workloads is that if you eventually rmdir() the
directory, all of the blocks will be released anyway.  That's probably
why this feature is one that hasn't been high priority.  If you delete
75% of the files in a directory, you *could* do something like "mkdir
foo.new ; mv foo/* foo.new ; rmdir foo ; mv foo.new foo", but of
course that won't work if some other process is actively accessing the
files when you are trying to do the "DIY directory shrink operation".

Cheers,

					- Ted