[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081125185254.GB525@mit.edu>
Date: Tue, 25 Nov 2008 13:52:54 -0500
From: Theodore Tso <tytso@....edu>
To: Solofo.Ramangalahy@...l.net
Cc: linux-ext4@...r.kernel.org
Subject: Re: [RFC 0/2] ext4: zero uninitialized inode tables
On Tue, Nov 25, 2008 at 01:28:47PM +0100, Solofo.Ramangalahy@...l.net wrote:
> * This could probably be made into a module, because it is not often in use.
> and this sentence from the OLS'02 paper
> "Since the online resizing code is only used very rarely, it would
> be possible to put the bulk of this code into a separate module that
> is only loaded when a resize operation is done."
> Inode zeroing is done only once in a filesystem lifetime (and each
> time it is resized).
Sure, but (a) zeroing the inode table should not be much code;
especially since we need to zero contiguous block ranges in extents.c
to deal with uninitialized extents. Also (b) modules have their own
cost; they waste on average PAGE_SIZE/2 worth of memory per module due
to internal fragmentation, and cause extra entries in the TLB cache
(on an x86, the entire kernel uses a single TLB entry, but each 4k
page used by a module burns a separate TLB entry), and (c) loading
modules is slow and serializes the kernel at boot time. In addition,
(d) some users simply prohibit modules for policy reasons.
So I could imagine adding module support, but it has to work built
into the kernel is critical, and this will probably the primary way it
will be compiled. And you need to make sure the kernel tries to
automatically load the module when a resize or a new filesystem is
mounted, if it is compiled as a module.
> I was also thinking that there may be other places to do it. For
> example, zeroing the inode table where the inode bitmap is initialized
> (ext4_init_inode_bitmap() called only once in
> ext4_read_inode_bitmap()).
When we first allocate an inode in the inode table and when we need to
zero it is largely unrelated. The problem is that e2fsck doesn't want
to trust the inode bitmap as being accurate; nor can it necessarily
trust the bg_inodes_unused field --- note particularly that this is
not updated in the backup group descriptors. Hence the window of
vulnerability has nothing to do with whether or not we have started
using a particular part of the inode table, but because the inode
table has not been initialized.
So we want to get the inode table fully initialized as soon as
possible, although we have to balance this with not impacting the
system's performance.
> The reasoning would have been to zero as soon as it is known to be
> needed:
> . without deferring it to the threads,
> . decreasing the probability of zeroing competing with other code
A block group's inode table can be 2-4 megabytes; and zeroing out that
many disk blocks can take a noticeable amount of time, so doing it
synchronously with an inode creation doesn't seem like a great idea to
me.....
> > > . fix the resize inode case
> >
> > Not sure what problem you were having here?
>
> With resize inode, the obtained filesystem is corrupted, fsck says
> "Resize inode not valid. Recreate?"
> as well as:
> "Free blocks count wrong for group #0 (6788, counted=6789)."
Something really bogus must be happening; the resize inode is in block
group 0, which always has some inodes (and therefore would never be
touched by your patch, which only initializes completely empty inode
tables). So if it managed to corrupt the resize inode, that's
especially worrisome.
> Appart from the data structures change you mentionned, these changes
> were discussed:
> . a mount option to disable the threads when doing testing/performance
> benchmarking
Yes, this makes sense. The administrator can always remount the filesystem
with a mount option to re-enable itable initialization if the
sysadmin wants to zero the inode table later on.
> . a flag in s_flags field of struct ext4_super_block to indicate that
> the zeroing has been done on all the groups. Possibly reset with
> resize.
This doesn't make that much sense to me. It's not that hard to
iterate through all of the block groups descriptors checking for the
EXT4_BG_INODE_ZEROED flag.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists