linux-ext4 - Re: [PATCH v2 3/3] ext4: reimplement ext4_empty_dir() using is_dirent_block

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAHB1NagobAcbdJ48TwnPhMauq5bh_f9SFAAcmp0YrsFfi5D+vw@mail.gmail.com>
Date: Mon, 29 Sep 2025 10:33:16 +0800
From: Julian Sun <sunjunchao2870@...il.com>
To: "Theodore Ts'o" <tytso@....edu>
Cc: Harshad Shirwadkar <harshadshirwadkar@...il.com>, adilger@...ger.ca, jack@...e.cz, 
	Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH v2 3/3] ext4: reimplement ext4_empty_dir() using is_dirent_block_empty

Hi,

I really appreciate the detailed explanation—your explanation made the
current state of this patch set much clearer to me.
I’ll follow up with the work as you suggested.

On Mon, Sep 29, 2025 at 5:02 AM Theodore Ts'o <tytso@....edu> wrote:
>
> On Sun, Sep 28, 2025 at 02:51:09PM +0800, Julian Sun wrote:
> > Emm. I checked the code and found that support for 3-level htrees was
> > added in 2017 via commit e08ac99fa2a2 ("ext4: add largedir feature"),
> > but this patch was submitted in 2020. Did I make a mistake somewhere?
>
> I made that statement when I tried forward porting Harshad's patches
> to 6.17-rc4, and one of the patch conflicts seemed to implyyy that
> dx_probe predated the3-level htree change.  But I could have been
> mistaken about why the patch conflict eixsts.
>
> > > There are
> > > also some hardening against maliciously fuzzed file systems that will
> > > prevent the patches from applying cleanly.
> >
> > Is this included in the xfstests test suite?
>
> For better or for worse, no.  The reason for this is that xfstests
> tends to avoid using pre-generated file systems becase they aren't
> portable to different file system types.  And in order to test
> deliberately corrupted, you generally need to include small corrupted
> file systems into xfstests.  You *could* try to create them as an
> ext4-specific test using debugfs to introduce the corruption, but
> that's quite painful, and in some cases the only way to corrupt the
> file system in that very specific way would require a hex editor.
>
> That being said, it's generally pretty obvious when bullet-proofing
> code has been added and it causes a merge conflict.  For example, I
> skipped forward-porting the third patch, "ext4: reimplement
> ext4_empty_dir() using is_dirent_block_empty" because there was some
> bullet-proofing code added in ext4_empty_dir() which caused the patch
> application to fail, and I was too third try to deal with it, and it
> strictly speaking wasn't necessary for the patch shrinking
> functionality.  That being said, the bullet proofing which was added
> to ext4_empty_dir(), *should* be added to the newly created
> i_dirent_block_empty() function introduced in the 2/3 patch in this
> series.
>
> > > Then we'd need to run regression tests on a variety of different ext4
> > > configurations to see if there are any regressions, and if so, they
> > > would need to be fixed.
> >
> > Is testing with xfstests sufficient? Or are there any other test
> > suites that can be used to test this patch set?
>
> Testing with xfstests are sufficient, but we need to test multiple
> file system configurations, but just the default "ext4 using a 4k
> blocksize".  So I tried to do a quick-and-dirty forward port of the
> patches, and they compiled and passed the 15-20 minute smoke test
> (running "kvm-xfstests smoke", or "gce-xfstests smoke").  But when I
> ran the full set of tests, this is what I found.
>
> ext4/4k: 594 tests, 4 failures, 65 skipped, 5196 seconds
>   Failures: generic/426 generic/756
>   Flaky: generic/041: 20% (1/5)   generic/049: 20% (1/5)
> ext4/1k: 588 tests, 3 failures, 69 skipped, 6306 seconds
>   Failures: generic/426 generic/756
>   Flaky: generic/043: 20% (1/5)
> ext4/ext3: 586 tests, 3 failures, 149 skipped, 4958 seconds
>   Flaky: generic/041: 40% (2/5)   generic/426: 60% (3/5)
>     generic/756: 60% (3/5)
> ext4/encrypt: 569 tests, 3 failures, 181 skipped, 3263 seconds
>   Failures: generic/426 generic/756
>   Flaky: generic/049: 40% (2/5)
> ext4/nojournal: 586 tests, 3 failures, 137 skipped, 3918 seconds
>   Failures: ext4/045 generic/426 generic/756
> ext4/ext3conv: 591 tests, 4 failures, 67 skipped, 5408 seconds
>   Failures: generic/426 generic/756
>   Flaky: ext4/045: 40% (2/5)   generic/040: 60% (3/5)
> ext4/adv: 587 tests, 10 failures, 73 skipped, 5487 seconds
>   Failures: [generic/347] generic/426 generic/756 [generic/757] [generic/764]
>     [generic/777]
>   Flaky: generic/047: 40% (2/5)   generic/475: 20% (1/5)
>     [generic/482: 40% (2/5)]   generic/547: 20% (1/5)
> ext4/dioread_nolock: 592 tests, 3 failures, 65 skipped, 4996 seconds
>   Failures: generic/426 generic/756
>   Flaky: generic/047: 40% (2/5)
> ext4/data_journal: 587 tests, 6 failures, 141 skipped, 5151 seconds
>   Failures: [generic/127] generic/426 generic/756
>   Flaky: generic/049: 60% (3/5)   generic/475: 60% (3/5)
>     generic/645: 40% (2/5)
> ext4/bigalloc_4k: 565 tests, 4 failures, 68 skipped, 5066 seconds
>   Failures: ext4/045
>   Flaky: generic/320: 60% (3/5)   generic/426: 60% (3/5)
>     generic/756: 60% (3/5)
> ext4/bigalloc_1k: 566 tests, 3 failures, 79 skipped, 5096 seconds
>   Failures: generic/426 generic/756
>   Flaky: generic/049: 20% (1/5)
> ext4/dax: 578 tests, 5 failures, 170 skipped, 3198 seconds
>   Failures: [generic/344] [generic/363] generic/426 generic/756
>   Flaky: generic/043: 20% (1/5)
> Totals: 7193 tests, 1264 skipped, 190 failures, 0 errors, 53862s
>
> (Tests in square brackets were failing with stock 6.17-rc4, so don't
> represent regressions.  Ignore them for the purposes of trying to fix
> up this patch set.)
>
> Now, some of thsee may be failures caused by my making a mistake when
> doing the forward port.  Basically, I bashed the patches until they
> built; and then I ran the smoke test; and then I kicked of the full
> test run, which takes about 2.5 hours using a dozen VM's.
>
> I don't have time to take this any further, so with your permission,
> if you're OK with my handing fiishing off this project to you, I'll
> send the patches as a reply to this message, and then "Your mission,
> should you use to accept it" (quoting from the "Mission Impossible" TV
> Show / Movie) would be:
>
> 1) Investigate the failures deailed above, and fix them.  Again, some
> of these failures may not be Harshads; it could be mine when I did the
> forward port.  Either way, we need to fix them before we can merge the
> changes upstream.
>
> 2) Do more cleanups on the patches as necessary.  While doing the
> forward port, I noted the following issues that should really be
> fixed.  You might find more things that need improvement; if so,
> please go for it!
>
> 2a) Although we are potentially modifying more metadata blocks in
> ext4_try_dir_shrink(), the number of journal credits requested by
> callers to ext4_delete_entry() were not increased.  This could result
> in the handle running out of credits, which will trigger a warning,
> and if the transaction runs out of space, could trigger a journal
> abort.  One way would simply be to increase the credits passed to
> ext4_journal_start().
>
> The other way would be to queue up the work and do the directory
> shrinking in workqueue, where the changes would be done in a
> completely separate journal handle.  This has the advantage that it
> avoids increasing the latency to the unlink() system call, since there
> is no reason why the change needs to happen as part of the system
> call, or in the same transaction as the unlink.
>
> 2b) Error checking is missing in some of the newly added code.  In
> particular the function calls in make_unindexed() are ignoring error
> returns.  More generally, while we do the directory shrnk, we need to
> check for potential problems; and if there are any failures, we need
> to leave the directory unchanged, possibly calling ext4_error() if
> file system corruption is dicovered, or just skipping the directory
> shirnk operation if it is some transient error --- for example, a
> memory allocation failure.
>
> 2c) As mentioned above, I skipped forward porting the 3rd patch
> series, and there is code to rigorously check for inconsistent file
> system corruptions lest we trigger a BUG or WARN or worse which is
> causing the patch application to fail.  That checking needs to be
> added to is_dirent_block_empty(), and while you're at it, please check
> all of the newly added code to make sure it won't misbehave if the
> file system structures have been corrupted in a particualrly inventive
> way.
>
> 2d)  Once (2c) is done forward port patch #3.
>
> 3) As discussed in this thread, once these patches are landed, the
> further work to merge leaf notes; to remove empty htree nodes; and to
> shorten the htree depth when necessary.
>
> > > Also, please note that this first set of changes doesn't really make a
> > > big difference for real-world use casses, since a directory block
> > > won't get dropped when it is completely empty....
> >
> > Yes, I think the biggest beneficiary is rm -rf-type workloads.
>
> The thing about rm -rf workloads is that if you eventually rmdir() the
> directory, all of the blocks will be released anyway.  That's probably
> why this feature is one that hasn't been high priority.  If you delete
> 75% of the files in a directory, you *could* do something like "mkdir
> foo.new ; mv foo/* foo.new ; rmdir foo ; mv foo.new foo", but of
> course that won't work if some other process is actively accessing the
> files when you are trying to do the "DIY directory shrink operation".
>
> Cheers,
>
>                                         - Ted
>

Thanks,
-- 
Julian Sun <sunjunchao2870@...il.com>