linux-ext4 - allowing ext4 file systems that wrapped inode count to continue working

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <7e4b1bc0-e5a2-e180-592d-8d61e90d9cf8@uls.co.za>
Date:   Tue, 24 Jul 2018 17:00:04 +0200
From:   Jaco Kroon <jaco@....co.za>
To:     Jan Kara <jack@...e.cz>, linux-ext4 <linux-ext4@...r.kernel.org>
Cc:     Theodore Ts'o <tytso@....edu>
Subject: allowing ext4 file systems that wrapped inode count to continue
 working

Hi,

Related to https://www.spinics.net/lists/linux-ext4/msg61075.html (and
possibly the cause of the the work from Jan in that patch series).

I have a 64TB (exactly) filesystem.

Filesystem OS type:       Linux
Inode count:              4294967295
Block count:              17179869184
Reserved block count:     689862348
Free blocks:              16910075355
Free inodes:              4294966285
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              128
RAID stripe width:        128
First meta block group:   1152
Flex block group size:    16

Note that in the above Inode count == 2^32-1 instead of the expected 2^32.

This results in the correct inode count being exactly 2^32 (which
overflows to 0).  A kernel bug (fixed by Jan) allowed this overflow in
the first place.

I'm busy trying to write a patch for e2fsck that would allow it to (on
top of the referenced series by Jan) enable fsck to at least clear the
filesystem from other errors where currently if I hack the inode count
to ~0U fsck, tune2fs and friends fail.

With the attached patch (sorry, Thunderbird breaks my inlining of
patches) tune2fs operates (-l at least) as expected, and fsck gets to
pass5 where it segfaults with the following stack trace (compiled with -O0):

/dev/exp/exp contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

Program received signal SIGSEGV, Segmentation fault.
0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
group=552320, bg_flag=1)
    at blknum.c:445
445             return gdp->bg_flags & bg_flag;
(gdb) bt
#0  0x00005555555ac8d1 in ext2fs_bg_flags_test (fs=0x555555811e90,
group=552320, bg_flag=1)
    at blknum.c:445
#1  0x000055555558c343 in check_inode_bitmaps (ctx=0x5555558112b0) at
pass5.c:759
#2  0x000055555558a251 in e2fsck_pass5 (ctx=0x5555558112b0) at pass5.c:57
#3  0x000055555556fb48 in e2fsck_run (ctx=0x5555558112b0) at e2fsck.c:249
#4  0x000055555556e849 in main (argc=5, argv=0x7fffffffdfe8) at unix.c:1859
(gdb) print *gdp
$1 = {bg_block_bitmap = 528400, bg_inode_bitmap = 0, bg_inode_table =
528456,
  bg_free_blocks_count = 0, bg_free_inodes_count = 0, bg_used_dirs_count
= 4000, bg_flags = 8,
  bg_exclude_bitmap_lo = 0, bg_block_bitmap_csum_lo = 0,
bg_inode_bitmap_csum_lo = 8,
  bg_itable_unused = 0, bg_checksum = 0, bg_block_bitmap_hi = 528344,
bg_inode_bitmap_hi = 0,
  bg_inode_table_hi = 528512, bg_free_blocks_count_hi = 0,
bg_free_inodes_count_hi = 0,
  bg_used_dirs_count_hi = 4280, bg_itable_unused_hi = 8,
bg_exclude_bitmap_hi = 0,
  bg_block_bitmap_csum_hi = 0, bg_inode_bitmap_csum_hi = 0, bg_reserved = 0}

... so I'm not sure why it even segfaults.  gdb can retrieve a value of
8 for bg_flags ... and yet, if the code does that it segfaults.  So not
sure what the discrepancy is there - probably a misunderstanding of
what's going wrong, but the only thing I can see that can segfault is
the gdp dereference, and since that seems to be a valid pointer ...

I am not sure if this is a separate issue, or due to me tampering with
the inode counter in the way that I am (I have to assume the latter). 
For testing I created a thin volume (1TB) in a separate environment,
where I created a 16TB filesystem initially, and then expanded that to
64TB, resulting in exactly the same symptoms we saw in production
environment.  I created a thousand empty files in the root folder.  The
filesystem is consuming 100GB on-disk currently in the thin volume. 
Note that group=552320 > 524288 (17179869184 / 32768).

Regarding further expansion, would appreciate some advise, there are two
(three) possible options that I could come up with:

1.  Find a way to reduce the number of inodes per group (say to 4096,
which would require re-allocating all inodes >= 2^31 to inodes <2^31).

2.  Allow to add additional blocks to the filesystem, without adding
additional inodes.

(3. Find some free space, create a new filesystem, and iteratively move
data from the one to the other, shrinking and growing the filesystems as
per progress - will never be able to move more data that what is
curently available on the system, around 4TB in my case, so will take a
VERY long time).

I'm currently aiming for option 2 since that looks to be the simplest. 
Simply allow overflow to happen, but don't allocate additional inodes if
number of inodes is already ~0U.

Kind Regards,
Jaco


View attachment "0001-Allow-opening-a-filesystem-with-maxed-out-inode-coun.patch" of type "text/x-patch" (2118 bytes)