linux-kernel - Re: [PATCH] Clustering indirect blocks in Ext3

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <d9885f0f0711170058p183b09b7g11ae93c1f11716db@mail.gmail.com>
Date:	Sat, 17 Nov 2007 00:58:21 -0800
From:	"Abhishek Rai" <abhishekrai@...gle.com>
To:	"Theodore Tso" <tytso@....edu>,
	"Abhishek Rai" <abhishekrai@...gle.com>,
	"Andrew Morton" <akpm@...ux-foundation.org>,
	"Andreas Dilger" <adilger@....com>, linux-kernel@...r.kernel.org,
	"Ken Chen" <kenchen@...gle.com>,
	"Mike Waychison" <mikew@...gle.com>
Subject: Re: [PATCH] Clustering indirect blocks in Ext3

Thanks for the comments.

On Nov 16, 2007 6:58 PM, Theodore Tso <tytso@....edu> wrote:
> On Fri, Nov 16, 2007 at 04:25:38PM -0800, Abhishek Rai wrote:
> > Ideally, this is how things should be done, but I feel in practice, it
> > will make little difference. To summarize, the difference between
> > my approach and above approach is that when out of free blocks in a
> > block group while allocating indirect block, the above approach repeats
> > the same allocation algorithm in the next block group, while I fully
> > fall back to old-style allocation meaning the indirect block gets
> > co-located with the data block in the next block group with a free
> > block.
>
> Well, also I suggested that if the metacluster region is full, that it
> attempt to find a block starting at end of the metacluster region and
> then wrap around, instead of starting at the beginning of the block
> group.  That way it's more likely that subsequent metadata block is
> nearer to the previous metadata blocks.

Ah ok. I think a generalization of this idea is that when we must mix
indirect blocks and data blocks, at least start the search for a free
block for them from different parts of the block group. Of course, an
added benefit of your suggestion is that the non-metaclustered
indirect blocks will likely be close to the metacluster.

I think this approach will help fsck but from a design point of view it
may not be good for IO performance. E.g., the reason sequential
read performance with metaclustering is same as regular is
because we prefetch co-located indirect blocks (I have verified this by
turning off prefetching). When allocating indirect
blocks outside the metacluster using the above scheme, co-location of
indirect blocks is still likely but less so. OTOH, by falling back to the
old-style allocation routines, we at least make sure that the indirect
block is close to its data block which is good for performance. I think
both approaches are pretty close. One thumb rule we've followed during
metaclustering design is: "when in doubt, favor IO performance over
fsck performance" so I tend to lean towards the latter approach.

> [More comments which I will incorporate in the code]
>
> > Allow me to propose a solution that will most likely address the above
> > issue and please ignore its complexity for a moment. Instead of a two
> > level partitioning in the block space between data blocks and
> > metacluster blocks, have a 3 or 4 level partitioning. E.g., a block
> > group with 'd' blocks can have d/32 blocks in metacluster level 1,
> > d/64 blocks in metacluster level 2, and d/128 blocks in metacluster
> > level 3 (define level 0 has having the remaining blocks = d - d/32 -
> > d/64 - d/128). Data block allocation starts looking for a free block
> > starting from the lowest possible level. If it is unable to find any
> > free blocks at that level in all block groups, it moves up a level and
> > so on. Indirect block allocation proceeds in the opposite direction
> > starting from higher levels. This approach has several benefits:
>
> That is clever.  Oh, one other thing.  You didn't mention what
> happened when the metacluster field was placed at the end of the block
> group.  I assume you tried that in your experiments; what were the
> results?  The obvious thing to do to avoid further fragmentation of
> the block group would be to put level 1 at the end of the block group,
> level 2 just before it, and level 3 before that, and then allocate the
> data blocks starting at the beginning of the block group, i.e:
>
> +----------------------------------+---------------+---------+-------+
> |     data                         | level 3       | level 2 | lvl 1 |
> +----------------------------------+---------------+---------+-------+
>

Thanks for this nice visualization :-)

I agree with your and Andreas' concern about fragmentation due to the
current scheme of putting metacluster in the middle of the block group.
Here are some stats concerning different metacluster locations:
- Placing metacluster at the end of the block group results in 2%
degradation in sequential reads from large files. Putting it at the
beginning improves sequential read performance by 0.5%.
- For random reads the beginning and ending configurations have
idential performance which is almost the same as regular ext3 performance
but 1% worse than the middle configuration.
- I haven't compared the different metacluster locations for sequential
reads from small files, but in general I've found the behavior to be
very similar to random reads from a large file.

So I think putting metacluster levels at the beginning of the block group
is an obvious choice.

>
> > In traditional metaclustering, once we run out of metacluster blocks
> > or data blocks, all bets are off. This forces us to keep small
> > metaclusters in order to avoid this situation altogether. But with small
> > metaclusters, we cannot optimize indirect block allocation on file
> > systems with many small files (>48KB).There is only one glitch in
> > implementing this. If a block group doesn't have any free blocks at a
> > given level, we should be able to find that out quickly instead of
> > having to scan its entire bitmap. gdp->bg_free_blocks_count is not good
> > enough for this.
>
> Ideally, true, but this was a defect with the original metacluster
> scheme as well.  We could steal some bits in the block_group
> descriptor structure to indicate whether a particular level is full,
> though.  This would be another data format change that would require
> e2fsprogs support, though.
>
> Regards,
>
>                                                 - Ted
>

Thanks,
Abhishekj
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/