linux-ext4 - Re: Proposed design for big allocation blocks for ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 25 Feb 2011 18:40:02 -0500
From:	Ted Ts'o <tytso@....edu>
To:	linux-ext4@...r.kernel.org
Subject: Re: Proposed design for big allocation blocks for ext4

On Fri, Feb 25, 2011 at 01:59:25PM -0800, Joel Becker wrote:
> 
> 	Why not call it a 'cluster' like the rest of us do?  The term
> 'blocksize' is overloaded enough already.

Yes, good point.  Allocation cluster makes a lot more sense as a name.

> > 3) mballoc.c will need little or no changes, other than the
> > EXT4_BLOCKS_PER_GROUP()/EXT4_ALLOC_BLOCKS_PER_GROUP() audit discussed
> > in (1).
> 
> 	Be careful in your zeroing.  A new allocation block might have
> pages at its front that are not part of the write() or mmap().  You'll
> either need to keep track that they are uninitialized, or you will have
> to zero them in write_begin() (ocfs2 does the latter).  We've had quite
> a few tricky bugs in this area, because the standard pagecache code
> handles the pags covered by the write, but the filesystem has to handle
> the new pages outside the write.

We're going to keep track of what blocks are uninitialized or not on a
4k basis.  So that part of the ext4 code doesn't change.

That being said, one of my primary design mantras for ext4 is, "we're
not going to optimize for sparse files".  They should work for
correctness sake, but if the file system isn't at its most performant
in the case of sparse files, I'm not going to shed any tears.

> 	It's a huge win for anything needing large files, like database
> files or VM images.  mkfs.ocfs2 has a vmimage mode just for this ;-)
> Even with good allocation code and proper extents, a long-lived
> filesystem with 4K clusters just gets fragmented.  This leads to later
> files being very discontiguous, which are slow to I/O to.  I think this
> is much more important than the simple speed-of-allocation win.

Yes, very true.

> > Directories will also be allocated in chucks of the allocation block
> > size.  If this is especially large (such as 1 MiB), and there are a
> > large number of directories, this could be quite expensive.
> > Applications which use multi-level directory schemes to keep
> > directories small to optimize for ext2's very slow large directory
> > performance could be especially vulnerable.
> 
> 	Anecdotal evidence suggests that directories often benefit with
> clusters of 8-16K size, but suffer greatly after 128K for precisely the
> reasons you describe.  We usually don't recommend clusters greater than
> 32K for filesystems that aren't expressly for large things.

Yes.  I'm going to assume that file systems optimized for large files
are (in general) not going to have lots of directories, and even if
they do, chewing a 1 megabyte for a directory isn't that a big of a
deal of you're talking about a 2-4TB disk.

We could add complexity to do suballocations for directories, but KISS
seems to be a much better idea for now.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html