[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <20080925223731.GM10950@webber.adilger.int>
Date: Thu, 25 Sep 2008 16:37:31 -0600
From: Andreas Dilger <adilger@....com>
To: Alex Tomas <bzzz@....com>
Cc: ext4 development <linux-ext4@...r.kernel.org>
Subject: Re: [RFC] dynamic inodes
On Sep 24, 2008 15:46 +0400, Alex Tomas wrote:
> another idea how to achieve more (dynamic) inodes:
Actually, José propsed a _very_ simple idea that would allow dynamic
inodes with relatively low code complexity or risk due to dynamic
placement of inode tables.
The basic idea is to extend the FLEX_BG feature so that (essentially)
"blockless groups" can be added to the filesystem when the inodes are
all gone. The core idea of FLEX_BG is that the "group metadata" (inode
and block bitmaps, inode table) can be placed anywhere in the filesystem.
This implies that a "block group" is strictly just a contiguous range of
blocks, and somewhere in the filesystem is the metadata that describes its
usage.
If one adds a new group (ostensibly "at the end of the filesystem") that
has a flag which indicates there are no blocks available in the group,
then what we get is the inode bitmap and inode table, with a 1-block
"excess baggage" of the block bitmap and a new group descriptor. The
"baggage" is small considering any overhead needed to locate and describe
fully dynamic inode tables.
A big plus is that there are very few changes needed to the kernel or
e2fsck (the "dynamic inode table" is just a group which has no space
for data). Some quick checks on 10 filesystems (some local, some
server) shows that there is enough contiguous space in the filesystems
to allocate a full inode table (between 1-4MB for most filesystems), and
mballoc can help with this. This makes sense because the cases where
there is a shortage of inodes also means there is an excess of space,
and if the inodes were under-provisioned it also (usually) means the
itable is on the smaller side.
Another important benefit is that the 32-bit inode space is used fully
before there is any need to grow to 64-bit inodes. This avoids the
compatibility issues with userspace to the maximum possible extent,
without any complex remapping of inode numbers.
We could hedge our bets for finding large enough contiguous itable space
and allow the itable to be smaller than normal, and mark the end inodes
as in-use. e2fsck will in fact consider any blocks under the rest of
the inode table as "shared blocks" and do duplicate block processing to
remap the data blocks. We could also leverage online defrag to remap
the blocks before allocating the itable if there isn't enough space.
Another major benefit of this approach is that the "dynamic" inode table
is actually relatively static in location, and we don't need a tree to
find it. We would continue to use the "normal" group inodes first, and
only add dynamic groups if there are no free inodes. It would also be
possible to remove the last dynamic group if all its inodes are freed.
The itable location would be replicated to all of the group descriptor
backups for safety, though we would need to find a way for "META_BG"
to store a backup of the GDT in blocks that don't exist, in the case
where increasing the GDT size in-place isn't possible.
The drawbacks of the approach is relatively coarse-grained itable
allocation, which would fail if the filesystem is highly fragmented,
but we don't _have_ to succeed either. The coarse-grained approach is
also a benefit because we don't need complex data structures to find the
itable, it reduces seeking during e2fsck, and we can keep some hysteresis
in adding/removing dynamic groups to reduce overhead (updates of many
GDT backups).
Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists