linux-ext4 - Re: No data blocks at all in my ext4 journal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180320041231.GC21416@thunk.org>
Date:   Tue, 20 Mar 2018 00:12:31 -0400
From:   "Theodore Y. Ts'o" <tytso@....edu>
To:     Jidong Xiao <0121167@...an.edu.cn>
Cc:     Andreas Dilger <adilger@...ger.ca>, linux-ext4@...r.kernel.org,
        Jidong Xiao <jidong.xiao@...il.com>
Subject: Re: No data blocks at all in my ext4 journal

First of all, can you try upgrading to the very latest version of
e2fsprogs.  You are using a very ancient version of e2fsprogs
(1.42.13.wc5) which has also been patched for Lustre.  If you use
e2fsprogs 1.44.0, then at least we'll be testing on roughly the same
version of e2fsprogs, just in case the issue is caused by how debugfs
logdump works.

Secondly, the file system is a really ancient one, with a very tiny
journal (32M).  These days we use a default of a much larger journal,
which is shown to provide much better performance.  (See section 4.1
of [1].)

[1] https://www.usenix.org/system/files/conference/fast17/fast17-aghayev.pdf

It looks liket you are looking at a live file system, and it's possible
that due to a combination of a small journal, journal wrapping, and an
old version of debugfs/logdump is causing the confusion.

So the other I would ask is that you try is to experiment on something
on your live root file system, so you can run a more controlled
experiment.  To that end, please install kvm-xfstests or
gce-xfstests[2].  Quick start instructions for kvm-xfstest are
available at [3].

[2] https://thunk.org/gce-xfstests
[3] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md

This will allow you to run a controlled experiment, something like this:

% kvm-xfstests --kernel /build/ext4-4.4 shell
   ....
root@...-xfstests:~# mke2fs -Fq -t ext4 /dev/vdc 
/dev/vdc contains a ext4 file system
	last mounted on Mon Jan  1 10:52:47 2018
root@...-xfstests:~# mount /dev/vdc
root@...-xfstests:~# cp -r xfstests /vdc ; sync
root@...-xfstests:~# C-a x   <==== type control-A, followed by x to abort QEMU
QEMU: Terminated

% debugfs -R "logdump -ac" kvm-xfstests/disks/vdc  > /tmp/logdump.out
debugfs 1.44.0 (7-Mar-2018)
% less /tmp/logdump.out

This means you're using a standard test environment.  You can use a
kernel built from upstream sources (detailed instructions for doing
this can be found at [4]), and the kvm-xfstests environment uses a
standard Debian environment with a stock e2fsprogs (no random
uncontrolled patches by Red Hat Enterprise Linux, and e2fsprogs with
random Lustre patches).  You'll also be looking at a aborted file
system, as opposed to a file system which is live and potentially
being modified in real time while you look at it with your tools.

[4] https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-xfstests.md

This will be much easier than my trying to figure out what's going on
your system.  I am suspicious of the version of e2fsprogs, I'm
suspicious of the fact that you are trying to examine a file system
while it is mounted and potentially being modified.  etc.

I can tell you that using a standard upstream 4.4 kernrel, and a
standard, unpatched, non-prehistoric version of e2fsprogs, probing a
file system which is aborted and not being modified while I look at
it, debugfs's logdump -ac shows me what I would expect.

And if a RHEL kernel had a journal with the results that you had, if
you pulled the power, and the journal was replayed, it would corrupt
the whole file system.  Since Red Hat Enterprise Linux users aren't
complained of completely destroyed file systems after a power failure,
I *know* your results must be somehow suspect.  How, I'm not sure.
But instead of trying to debug your random environment, why don't you
try using a standard development/test environment?

Regards,

      	    	    		    	  	     - Ted