linux-ext4 - Re: How capacious and well-indexed are ext4, xfs and btrfs directories?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <B70B57ED-6F11-45CC-B99F-86BBDE36ACA4@dilger.ca>
Date:   Tue, 25 May 2021 15:13:52 -0600
From:   Andreas Dilger <adilger@...ger.ca>
To:     Josh Triplett <josh@...htriplett.org>
Cc:     David Howells <dhowells@...hat.com>, Theodore Ts'o <tytso@....edu>,
        "Darrick J. Wong" <djwong@...nel.org>, Chris Mason <clm@...com>,
        Ext4 Developers List <linux-ext4@...r.kernel.org>,
        xfs <linux-xfs@...r.kernel.org>,
        linux-btrfs <linux-btrfs@...r.kernel.org>,
        linux-cachefs@...hat.com,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        NeilBrown <neilb@...e.com>
Subject: Re: How capacious and well-indexed are ext4, xfs and btrfs
 directories?

On May 22, 2021, at 11:51 PM, Josh Triplett <josh@...htriplett.org> wrote:
> 
> On Thu, May 20, 2021 at 11:13:28PM -0600, Andreas Dilger wrote:
>> On May 17, 2021, at 9:06 AM, David Howells <dhowells@...hat.com> wrote:
>>> With filesystems like ext4, xfs and btrfs, what are the limits on directory
>>> capacity, and how well are they indexed?
>>> 
>>> The reason I ask is that inside of cachefiles, I insert fanout directories
>>> inside index directories to divide up the space for ext2 to cope with the
>>> limits on directory sizes and that it did linear searches (IIRC).
>>> 
>>> For some applications, I need to be able to cache over 1M entries (render
>>> farm) and even a kernel tree has over 100k.
>>> 
>>> What I'd like to do is remove the fanout directories, so that for each logical
>>> "volume"[*] I have a single directory with all the files in it.  But that
>>> means sticking massive amounts of entries into a single directory and hoping
>>> it (a) isn't too slow and (b) doesn't hit the capacity limit.
>> 
>> Ext4 can comfortably handle ~12M entries in a single directory, if the
>> filenames are not too long (e.g. 32 bytes or so).  With the "large_dir"
>> feature (since 4.13, but not enabled by default) a single directory can
>> hold around 4B entries, basically all the inodes of a filesystem.
> 
> ext4 definitely seems to be able to handle it. I've seen bottlenecks in
> other parts of the storage stack, though.
> 
> With a normal NVMe drive, a dm-crypt volume containing ext4, and discard
> enabled (on both ext4 and dm-crypt), I've seen rm -r of a directory with
> a few million entries (each pointing to a ~4-8k file) take the better
> part of an hour, almost all of it system time in iowait. Also makes any
> other concurrent disk writes hang, even a simple "touch x". Turning off
> discard speeds it up by several orders of magnitude.
> 
> (I don't know if this is a known issue or not, so here are the details
> just in case it isn't. Also, if this is already fixed in a newer kernel,
> my apologies for the outdated report.)

Definitely "-o discard" is known to have a measurable performance impact,
simply because it ends up sending a lot more requests to the block device,
and those requests can be slow/block the queue, depending on underlying
storage behavior.

There was a patch pushed recently that targets "-o discard" performance:
https://patchwork.ozlabs.org/project/linux-ext4/list/?series=244091
that needs a bit more work, but may be worthwhile to test if it improves
your workload, and help put some weight behind landing it?

Another proposal was made to change "-o discard" from "track every freed
block and submit TRIM" to "(persistently) track modified block groups and
submit background TRIM like fstrim for the whole group".  One advantage
of tracking the whole block group is that block group state is already
maintained in the kernel and persistently on disk.  This also provides a
middle way between "immediate TRIM" that may not cover a whole erase block
when it is run, and "very lazy fstrim" that aggregates all free blocks in
a group but only happens when fstrim is run (from occasionally to never).

The in-kernel discard+fstrim handling could be smarter than "run every day
from cron" because it can know when the filesystem is busy or not, how much
data has been written and freed, and when a block group has a significant
amount of free space and is useful to actually submit the TRIM for a group.

The start of that work was posted for discussion on linux-ext4:
https://marc.info/?l=linux-ext4&m=159283169109297&w=4
but ended up focussed on semantics of whether TRIM needs to obey requested
boundaries for security reasons, or not.

Cheers, Andreas

Download attachment "signature.asc" of type "application/pgp-signature" (874 bytes)