linux-ext4 - Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190712191903.GP2772@twosigma.com>
Date:   Fri, 12 Jul 2019 15:19:03 -0400
From:   Thomas Walker <Thomas.Walker@...sigma.com>
To:     Theodore Ts'o <tytso@....edu>
CC:     Geoffrey Thomas <Geoffrey.Thomas@...sigma.com>,
        'Jan Kara' <jack@...e.cz>,
        "'linux-ext4@...r.kernel.org'" <linux-ext4@...r.kernel.org>,
        "Darrick J. Wong" <darrick.wong@...cle.com>
Subject: Re: Phantom full ext4 root filesystems on 4.1 through 4.14 kernels

On Thu, Jul 11, 2019 at 01:10:46PM -0400, Theodore Ts'o wrote:
> Can you try using "df -i" when the file system looks full, and then
> reboot, and look at the results of "df -i" afterwards?

inode usage doesn't change appreciably between the state with "lost" space, after the remount workaround, or after a reboot.

> Also interesting would be to grab a metadata-only snapshot of the file
> system when it is in its mysteriously full state, writing that
> snapshot on some other file system *other* than on /dev/sda3:
> 
>      e2image -r /dev/sda3 /mnt/sda3.e2i
> 
> Then run e2fsck on it:
> 
> e2fsck -fy /mnt/sda3.e2i
> 
> What I'm curious about is how many "orphaned inodes" are reported, and
> how much space they are taking up.  That will look like this:

<..>
Clearing orphaned inode 2360177 (uid=0, gid=0, mode=0100644, size=1035390)
Clearing orphaned inode 2360181 (uid=0, gid=0, mode=0100644, size=185522)
Clearing orphaned inode 2360176 (uid=0, gid=0, mode=0100644, size=1924512)
Clearing orphaned inode 2360180 (uid=0, gid=0, mode=0100644, size=3621978)
Clearing orphaned inode 1048838 (uid=0, gid=4, mode=0100640, size=39006841856)
release_inode_blocks: Corrupt extent header while calling ext2fs_block_iterate for inode 1048838
<..>

Of particular note, ino 1048838 matches the size of the space that we "lost".
A few months ago I was poking at this with kprobes to try to understand what was happening with the attempt to remount read-only and noticed that it triggered hundreds of:

           <...>-78273 [000] .... 5186888.917840: ext4_remove_blocks: dev 8,3 ino 2889535 extent [0(11384832), 2048]from 0 to 2047 partial_cluster 0
           <...>-78273 [000] .... 5186888.917841: <stack trace>
 => ext4_ext_truncate
 => ext4_truncate
 => ext4_evict_inode
 => evict
 => iput
 => dentry_unlink_inode
 => __dentry_kill
 => dput.part.23
 => dput
 => SyS_rename
 => do_syscall_64
 => entry_SYSCALL_64_after_hwframe

With all the same inode numbers whose sizes added up to the same amount of space that we had "lost" and got back.  But the inode didn't map to any file or open filehandle though.
Obviously the inode numbers match here and, if I total up all of the ext4_remove_blocks lines, the number of blocks match both what fsck reports and what we "lost"

> 
> ...
> 
> It's been theorized the bug is in overlayfs, where it's holding inodes
> open so the space isn't released.  IIRC somewhat had reported a
> similar problem with overlayfs on top of xfs.  (BTW, are you using
> overlayfs or aufs with your Docker setup?)
> 

Yes, we are using overlayfs and had also heard similar reports.  But the ext4 fs here is the root filesystem,
while all of the overlays are using a separate partition using XFS.  The only interplay between our root fs and
the docker containers is the occasional ro bind mount from root into the container.

Unfortunately, we've not been able to reproduce this outside of our production plant running real workload with real data.  I did capture a metadata dump (with dirents scrambled) as Jan asked, but I suspect it will take some work to get past our security team.  I can certainly do that if we think there is anything valuable though.

Thanks,
Tom.