[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.LFD.2.00.1306271123000.1781@localhost.localdomain>
Date: Thu, 27 Jun 2013 11:24:19 +0200 (CEST)
From: Lukáš Czerner <lczerner@...hat.com>
To: Radek Pazdera <rpazdera@...hat.com>
cc: linux-ext4@...r.kernel.org, kasparek@....vutbr.cz
Subject: Re: [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index
On Sat, 4 May 2013, Radek Pazdera wrote:
> Date: Sat, 4 May 2013 23:28:33 +0200
> From: Radek Pazdera <rpazdera@...hat.com>
> To: linux-ext4@...r.kernel.org
> Cc: lczerner@...hat.com, kasparek@....vutbr.cz,
> Radek Pazdera <rpazdera@...hat.com>
> Subject: [RFC 0/9] ext4: An Auxiliary Tree for the Directory Index
Hi Radek,
patches do not apply cleanly any more on the ext4 dev branch. Can
you rebase and resend the whole patch set ?
Thanks!
-Lukas
>
> Hello everyone,
>
> I am an university student from Brno /CZE/. I decided to try to optimise
> the readdir/stat scenario in ext4 as the final project to school. I
> posted some test results I got few months ago [1].
>
> I tried to implement an additional tree for ext4's directory index
> that would be sorted by inode numbers. The tree then would be used
> by ext4_readdir() which should lead to substantial increase of
> performance of operations that manipulate a whole directory at once.
>
> The performance increase should be visible especially with large
> directories or in case of low memory or cache pressure.
>
> This patch series is what I've got so far. I must say, I originally
> thought it would be *much* simpler :).
>
> TLDR
> ====
>
> The series contains the implementation of the new tree and several
> rather small changes to the original directory index. I also added
> a new implementation of ext4_readdir() that uses this new tree
> instead of the original one.
>
> It applies on 3.9, it is basically working, however, it is highly
> experimental at the moment. It doesn't survive XFS tests yet (I
> still need to work on that).
>
> DESIGN
> ======
>
> The tree comes with a new read-only compatible file system feature
> called "itree". It is created in a directory at the same time as the
> original dx tree -- when the directory file exceeds a signle block of
> size. It is indicated by an inode flag.
>
> The tree resides completely outside of the directory file. I am using
> 64bit block numbers (as suggested by Ted Ts'o), and the pointer to its
> root is stored in the end of the dx_root block.
>
> I tried to keep the structure of the tree as close as possible to the
> original dx tree. It is a B+ tree with a 64bit long compound key. The
> key consists of a inode number and a hash.
>
> The hash is there because of hard links. On ext4 there can be as much
> as 65 000 names for a single file which is, from the point of the inode
> tree, a collision. It is a very unlikely scenario to crate over 60 000
> names for a file from a single directory. But it would be a very serious
> problem for readdir() if someone tried to do that, so I decided to add
> the hash to the key as well. This reduces the problems of collisions to
> the same as of the hash tree.
>
> The depth of the tree is limited by a constant in the code to 3 levels,
> but it can be increased anytime.
>
> I decided to keep the directory entries within the leaf nodes sorted,
> which might have been a bad idea (and it brought me quite a few
> sleepless nights of pointer debugging). However, it is more convenient
> for readdir and splits. And in the case of the inode tree, there
> shouldn't be that much memmoving, because inodes are allocated linearly,
> therefore we're adding to the end most of the time anyway.
>
> I also had to implement coalesce-on-delete. Unlike the hash values,
> the inode numbers are not random, so the tree would get fragmented
> really easily. For example when a range of inodes would be freed
> and allocated somewhere else, that part of the tree would stay empty
> forever.
>
> In this implementation I used 8 bits for "node fullness". Neighbour
> nodes in the tree are merged as soon as their total fullness is smaller
> than maximum fullness. This way, coalescing happens too often. I plan
> to reduce it to only 2 bits (25% full, 50% full, 75% full, 100% full).
>
> There are also some optimisations to increase the fullness of blocks
> within the tree, because if the file system adds a contiguous sequence
> of inodes to a directory and we split the nodes in half, there will be
> some tree nodes that will never get another entry and still be only 50%
> full. In the current implementation, an append is performed instead of
> a split in case the new entry should be added to the end of the full
> block.
>
> BENCHMARKS
> ==========
>
> I did some benchmarks and compared the performance with ext4/htree,
> XFS, and btrfs up to 5 000 000 of files in a single directory. Not
> all of them are done though (they run for days).
>
> Probably the biggest improvement can be observed with copying files.
> I used two physical SATA disks for the test. For 5M files, the itree
> implementation is 14x faster. With metadata only, ext4 is even a tiny
> bit faster than xfs:
>
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/cp.png
>
> With each files 4kB big, the inode tree is 11x faster:
>
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/cp.png
>
> On the other hand, probably the biggest drawback of this implementation
> is that it slows down file creates, as we now have to insert the entry
> into both trees. The difference gets bigger as both trees grow (and their
> blocks get further apart). For 5M directory entries, the creation is
> roughly 40% slower. Hopefully, I'll be able to optimise it a bit in the
> future:
>
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/create.png
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/create.png
>
> Deleting entries should also very much benefit from this. However, the
> increase of performance in my test is far lower than expected. I think
> that is due to my implementation of coalesce-on-delete, which happens
> too often and during these massive deletes, it coallesces all the time.
> I hope that when I fix that, it will get somewhere close to XFS as well:
>
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/0B-files/delete.png
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/4096B-files/delete.png
>
> All of the results have a graph and a table with precise values.
> You can always get the table by replacing the suffix of the graph
> *.png to *.results in the end of the link.
>
> Full results are available here:
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/ext4-5M/
>
>
> I also did some tests on an aged file system (I used the simple 0.8
> chance to create, 0.2 to delete a file) where the results of ext4
> with itree are much better even than xfs, which gets fragmented:
>
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/cp.png
> http://www.stud.fit.vutbr.cz/~xpazde00/soubory/5M-dirty/readdir-stat.png
>
>
> There are still some things to be done, the checksums are not yet
> finished and I certainly need to do a bit of cleaning up at some
> places. But as far as features go, it all should be there already.
>
> I'm working on this along with Lukas Czerner, who acts as my mentor
> and helps me with different things (big thanks!).
>
> Any feedback/ideas are greatly appeciated :)!
>
> Cheers,
> Radek
>
>
> Radek Pazdera (9):
> ext4: Adding itree feature and inode flags
> ext4: Allow sorting dx_map by inode as well
> ext4: Adding a link to itree to the dx_root struct
> ext4: Adding itree structures
> ext4: Adding itree implementation I - Core
> ext4: Adding itree implementation II - Inserting
> ext4: Adding itree implementation III - Deleting
> ext4: Make directory operations use itree
> ext4: Make ext4_readdir() use itree if available
>
> fs/ext4/dir.c | 177 +++-
> fs/ext4/ext4.h | 21 +-
> fs/ext4/namei.c | 2442 +++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 3 files changed, 2565 insertions(+), 75 deletions(-)
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists