linux-ext4 - 64bit inode number and dynamic inode table for ext4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1174585546.16068.50.camel@localhost.localdomain>
Date:	Thu, 22 Mar 2007 09:45:46 -0800
From:	Mingming Cao <cmm@...ibm.com>
To:	Andreas Dilger <adilger@...sterfs.com>, tytso@....edu
Cc:	linux-ext4@...r.kernel.org
Subject: 64bit inode number and dynamic inode table for ext4

On Wed, 2007-03-21 at 19:53 -0700, Avantika Mathur wrote:
> Ext4 Developer Interlock Call: 03/21/2007 Meeting Minutes
...
> 64 bit Inode and Dynamic Inode Table Discussion:
> - Though this feature has been discussed for many years; there does not seem to be high demand currently for 64 bit inode numbers, but it is a problem which will eventually arise.

The benefit of dynamic inode table is clear, not only it could scales up
the number of inode of files fs could support, it could also help speed
up fsck since there are only used inode stored in fs. fsck scalability
issue is a more high demand with now that ext4 could support larger
filesystem.

> - If this incompat feature is implemented, there are many other changes that need to be considered.
> - Mingming and Ted suggested the inode number could be based on block number, with 48 bits for block number, and 5-7 bits for the offset; to directly point to the inode location.

Here is the basic idea about the dynamic inode table:

In default 4k filesystem block size, a inode table block could store 4
265  bytes inode structures(4*265 = 4k). To avoid inode table blocks
fragmentation, we could allocate a cluster of contigous blocks for inode
tables at run time, for every say, 64 inodes or 8 blocks 16*8=64 inodes.

To efficiently allocate and deallocate inode structures, we could link
all free/used inode structures within the block group and store the
first free/used inode number in the block group descriptor. 

There are some safety concern with dynamic inode table allocation in the
case of block group corruption.  This could be addressed by checksuming
the block group descriptor.

With dynamical inode table, the block to store the inode structure is
not at fixed location anymore. One idea to efficiently map the inode
number to the block store the corresponding inode structure is encoding
the block number into the inode number directly. This implies to use 64
bit inode number. The low 4-5 bit of the inode number stores the offset
bits within the inode table block, and the rest of 59 bits is enough to
store the 48 bit block number, or 32 bit block group number + relative
block number within the group:

63               47               31          20             4      0
----------------|-----------------------------|--------------|------|
|               | 32bit group #               |   15 bit     | 5bit |
|               |                             |   blk #      |offset|
----------------|-----------------------------|--------------|------|

The bigger concern is possible inode number collision if we choose 64
bit inode number.  Although today linux kernel VFS layer is fixed to
handle 64 bit inode number, applications might still using 32 bit stat()
to access inode numbers could break.  It is unclear how common this case
is, and whether by now the application is fixed to use the 64 bit stat64
().  

One solution is avoid generate inode number >2**32 on 32 bit platform.
Since ext4 only could address 16TB fs on 32 bit arch, the max of group
number is 2**17 (2**17 * 2**15 blocks = 2**32 blocks = 16TB(on 4k blk)),
if we could force that inode table blocks could only be allocated at the
first 2**10 blocks within a block group, like this:

63               47               31               15          4      0
----------------|----------------|------------------|---------|------|
|               | High 15 bit    |low 17bit grp #   |10 bit   | 5bit |
|               | grp #          |                  |blk #    |offset|
----------------|----------------|------------------|---------|------|


Then on 32 bit platform, the inode number is always <2**32. So even if
inode number on fs is 64 bit, since it's high 32 bit is always 0, user
application using stat() will get unique inode number.

On 64 bit plat format, there should not be collision issue for 64 bit
applications. For 32 bit application running on 64 bit platform,
hopefully they are fixed by now. or we could force the inode table block
allocated at the first 16TB of fs, since anyway we need meta block group
to support >256TB fs, and that already makes the inode structure apart
from the data blocks.


> 	- Andreas is concerned about inode relocation, it would take a lot of effort; because references to the inode would have to be updated.  

I am not clear about this concern. Andreas, are you worried about online
defrag? I thought online defrag only transfer the extent maps from the
temp inode to the original inode, we do not transfer inode number and
structure.


> 	- Another option Andreas suggested is the inode number be an offset in and inode table.  The table could be virtually mapped around the filesystem, and also be defragmented.
> 	- Ted believes that this could be used as a faster way of dealing with the 32 bit stat proble, because the logical block numbers that the inode number represents could be used to see what the 32 bit inode number would be.
> - There are many issues to address before 64 bit inodes can be fully implemented, Andreas sees this feature as a very long term future plan.

I agree there are many ext4 features could be done in short-term, but
thinking back why we have ext4: it was initially started by address
scalability issue: fs limit and large file performance (32 bit block
number issue and extent). It was cloned from ext3 mostly political
reason, but having a new fs also allow us to design ext4 for a longer
view.  Since we are already in ext4, and now it is still called ext4dev,
why postpone it later. Think about how long it takes ext3 from start to
stable and then ext4 start with extent and 48/64 bit bit number (10
years?), I think ext5 is at least 10 years away. There are customer
already use millions or billions of files today, or even ask for
trillions of files, it could be a issue hit us within a few years.


Regards,

Mingming

-
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html