linux-ext4 - Re: [RFC] dynamic inodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20080925201039.454bf742@gara>
Date:	Thu, 25 Sep 2008 20:10:39 -0500
From:	"Jose R. Santos" <jrs@...ibm.com>
To:	Andreas Dilger <adilger@....com>
Cc:	Alex Tomas <bzzz@....com>,
	ext4 development <linux-ext4@...r.kernel.org>
Subject: Re: [RFC] dynamic inodes

On Thu, 25 Sep 2008 16:37:31 -0600
Andreas Dilger <adilger@....com> wrote:

> On Sep 24, 2008  15:46 +0400, Alex Tomas wrote:
> > another idea how to achieve more (dynamic) inodes:
> 
> Actually, José propsed a _very_ simple idea that would allow dynamic
> inodes with relatively low code complexity or risk due to dynamic
> placement of inode tables.
> 
> The basic idea is to extend the FLEX_BG feature so that (essentially)
> "blockless groups" can be added to the filesystem when the inodes are
> all gone.  The core idea of FLEX_BG is that the "group metadata" (inode
> and block bitmaps, inode table) can be placed anywhere in the filesystem.
> This implies that a "block group" is strictly just a contiguous range of
> blocks, and somewhere in the filesystem is the metadata that describes its
> usage.
> 
> If one adds a new group (ostensibly "at the end of the filesystem") that
> has a flag which indicates there are no blocks available in the group,
> then what we get is the inode bitmap and inode table, with a 1-block
> "excess baggage" of the block bitmap and a new group descriptor.  The
> "baggage" is small considering any overhead needed to locate and describe
> fully dynamic inode tables.
> 
> A big plus is that there are very few changes needed to the kernel or
> e2fsck (the "dynamic inode table" is just a group which has no space
> for data).  Some quick checks on 10 filesystems (some local, some
> server) shows that there is enough contiguous space in the filesystems
> to allocate a full inode table (between 1-4MB for most filesystems), and
> mballoc can help with this.  This makes sense because the cases where
> there is a shortage of inodes also means there is an excess of space,
> and if the inodes were under-provisioned it also (usually) means the
> itable is on the smaller side.
> 
> Another important benefit is that the 32-bit inode space is used fully
> before there is any need to grow to 64-bit inodes.  This avoids the
> compatibility issues with userspace to the maximum possible extent,
> without any complex remapping of inode numbers.
> 
> We could hedge our bets for finding large enough contiguous itable space
> and allow the itable to be smaller than normal, and mark the end inodes
> as in-use.  e2fsck will in fact consider any blocks under the rest of
> the inode table as "shared blocks" and do duplicate block processing to
> remap the data blocks.  We could also leverage online defrag to remap
> the blocks before allocating the itable if there isn't enough space.
> 
> Another major benefit of this approach is that the "dynamic" inode table
> is actually relatively static in location, and we don't need a tree to
> find it.  We would continue to use the "normal" group inodes first, and
> only add dynamic groups if there are no free inodes.  It would also be
> possible to remove the last dynamic group if all its inodes are freed.
> 
> The itable location would be replicated to all of the group descriptor
> backups for safety, though we would need to find a way for "META_BG"
> to store a backup of the GDT in blocks that don't exist, in the case
> where increasing the GDT size in-place isn't possible.

One way to get around this is to implement the exact opposite of what I
proposed earlier and have a block group with no inode tables.  If we do
a 1:1 distribution of inode per block and don't allocate inodes tables
for a series of block groups within a flexbg we could later on attempt
to allocate new inode tables when we run out of inodes.  If we leave
holes in the inode numbers for the missing inode tables, adding new
inode tables in these block groups would not require any inode
renumbering.  This also does not break the current inode allocator
which would be a good thing.  This should be even simpler to implement
than the previous proposal.  The drawbacks are that when allocating a
new inode table, the 1:1 distribution of inode per block would mean
that we need to find a bigger chunk on contiguous blocks to since we
have bigger inode tables per block group.  Since the current inode
allocator tries to keep a 10% of blocks in a flexbg free, finding
contiguous blocks may not be a really big issue.  Another issue is 64bit
filesystem if we use a 1:1 scheme.

This would be like uninitialized inode tables with the added steps of
finding free blocks, allocating a new inode and zeroing the newly
created inode table.  Since we could chose to allocate a new inode
table on a flexbg with the most free blocks, this could keep filesystem
meta-data/data layout consistently close together to maintain
predictable performance.  This option also has no overhead compared to
the previous proposal.

> 
> The drawbacks of the approach is relatively coarse-grained itable
> allocation, which would fail if the filesystem is highly fragmented,
> but we don't _have_ to succeed either.  The coarse-grained approach is
> also a benefit because we don't need complex data structures to find the
> itable, it reduces seeking during e2fsck, and we can keep some hysteresis
> in adding/removing dynamic groups to reduce overhead (updates of many
> GDT backups).
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
> 
> --

-JRS
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html