[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120301143859.GX5054@shiny>
Date: Thu, 1 Mar 2012 09:38:59 -0500
From: Chris Mason <chris.mason@...cle.com>
To: Theodore Tso <tytso@....EDU>
Cc: Jacek Luczak <difrost.kernel@...il.com>,
linux-ext4@...r.kernel.org,
linux-fsdevel <linux-fsdevel@...r.kernel.org>,
LKML <linux-kernel@...r.kernel.org>, linux-btrfs@...r.kernel.org
Subject: Re: getdents - ext4 vs btrfs performance
On Wed, Feb 29, 2012 at 11:44:31PM -0500, Theodore Tso wrote:
> You might try sorting the entries returned by readdir by inode number before you stat them. This is a long-standing weakness in ext3/ext4, and it has to do with how we added hashed tree indexes to directories in (a) a backwards compatible way, that (b) was POSIX compliant with respect to adding and removing directory entries concurrently with reading all of the directory entries using readdir.
>
> You might try compiling spd_readdir from the e2fsprogs source tree (in the contrib directory):
>
> http://git.kernel.org/?p=fs/ext2/e2fsprogs.git;a=blob;f=contrib/spd_readdir.c;h=f89832cd7146a6f5313162255f057c5a754a4b84;hb=d9a5d37535794842358e1cfe4faa4a89804ed209
>
> … and then using that as a LD_PRELOAD, and see how that changes things.
>
> The short version is that we can't easily do this in the kernel since it's a problem that primarily shows up with very big directories, and using non-swappable kernel memory to store all of the directory entries and then sort them so they can be returned in inode number just isn't practical. It is something which can be easily done in userspace, though, and a number of programs (including mutt for its Maildir support) does do, and it helps greatly for workloads where you are calling readdir() followed by something that needs to access the inode (i.e., stat, unlink, etc.)
>
For reading the files, the acp program I sent him tries to do something
similar. I had forgotten about spd_readdir though, we should consider
hacking that into cp and tar.
One interesting note is the page cache used to help here. Picture two
tests:
A) time tar cf /dev/zero /home
and
cp -a /home /new_dir_in_new_fs
unmount / flush caches
B) time tar cf /dev/zero /new_dir_in_new_fs
On ext, The time for B used to be much faster than the time for A
because the files would get written back to disk in roughly htree order.
Based on Jacek's data, that isn't true anymore.
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists