linux-ext4 - Re: [PATCH] ext4: fix interaction between i_size, fallocate, and delalloc after a crash

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [day] [month] [year] [list]
Message-ID: <CAOQ4uxhjY6yw2szqP0gn3uzUgDeDe6a0HUR-4oNFZuyLZ40R0w@mail.gmail.com>
Date:   Tue, 17 Oct 2017 10:15:14 +0300
From:   Amir Goldstein <amir73il@...il.com>
To:     Vijay Chidambaram <vvijay03@...il.com>
Cc:     "Theodore Ts'o" <tytso@....edu>,
        Ashlie Martinez <ashmrtn@...xas.edu>,
        Eryu Guan <eguan@...hat.com>,
        Ext4 <linux-ext4@...r.kernel.org>, Josef Bacik <jbacik@...com>,
        Xiao Yang <yangx.jy@...fujitsu.com>,
        fstests <fstests@...r.kernel.org>
Subject: Re: [PATCH] ext4: fix interaction between i_size, fallocate, and
 delalloc after a crash

On Tue, Oct 17, 2017 at 3:43 AM, Vijay Chidambaram <vvijay03@...il.com> wrote:
> Hi Ted,
>
> This was what we were suspecting as well. If the bug being exposed or not
> only depends on the number of transactions in the journal before the test
> workload starts, that’s something we could easily capture in CrashMonkey. It
> would be one more parameter to tweak, and we would potentially have to get
> multiple traces of the test workload with the journal being X percent full.
>
> It does expand our already-large search space, but our first order of
> business is making sure CrashMonkey can reproduce every crash-consistency
> bug reported in recent times (mostly by Amir :) ). So for now we were just
> analyzing the bug and trying to understand it, but it looks like the
> post-recovery image is not very useful for this.
>
> On Mon, Oct 16, 2017 at 7:09 PM Theodore Ts'o <tytso@....edu> wrote:
>>
>> On Tue, Oct 17, 2017 at 12:11:40AM +0300, Amir Goldstein wrote:
>> >
>> > The disk image SHOULD reflect a state on a disk after the power was
>> > cut in the middle of mounted fs. Then power came back on, filesystem
>> > was mounted, journal recovered, then filesystem was cleanly unmounted.
>> > At this stage, I don't expect there should be anything interesting in
>> > the
>> > journal.
>>
>> I suspect what Ashlie was hoping for was a file system image *before*
>> the file system was remounted and the journal replayed (and then
>> truncated).  That would allow for an analysis of image right after the
>> simulated power cut, so it could be seen what was in the journal.
>>
>> The only way to get that is to modify the test so that it aborts
>> before the file system is remounted.  I did some investigations where
>> I ran commands (such as "debugfs -R "logdump -ac /dev/vdc") before the
>> file system was remounted to gather debugging information.  That's how
>> I tracked down the problem.  Unfortunately I never bothered to grab
>> full file system snapshot, so I can't give Ashlie what she's hoping
>> for.
>>

I can provide the desired file system snapshot after crash, but I will do more
than that. I will provide an io recording of the test and will explain how to
how to replay the recording to any given point before the crash.

To produce the recording, I copied this hunk from test 456:
fsxops=$tmp.fsxops
cat <<EOF > $fsxops
write 0x137dd 0xdc69 0x0
fallocate 0xb531 0xb5ad 0x21446
collapse_range 0x1c000 0x4000 0x21446
write 0x3e5ec 0x1a14 0x21446
zero_range 0x20fac 0x6d9c 0x40000 keep_size
mapwrite 0x216ad 0x274f 0x40000
EOF
run_check $here/ltp/fsx -d --replay-ops $fsxops $SCRATCH_MNT/testfile

to test 455 before line NUM_FILES=4 and changed the line to
NUM_FILES=0 to suppress the randomized fsx runs.

Now test 455 fails and leaves us a recording of all io on the
$LOGWRITES_DEV device.

As we can see in 455.full, the consistency check fails on "pre umount"
checkpoint, which test 455 has marked in the log device as mark "last":

  checking pre umount ("_log_writes_replay_log last" in the test)
  _check_generic_filesystem: filesystem on /dev/mapper/ssd-scratch is
inconsistent

To export the recording from my log device I ran:
# ./src/log-writes/replay-log --log $LOGWRITES_DEV --find --end-mark last
17098@...852

Meaning that the "last" mark is record #17098 at sector offset 153852
on the log device, so I can export content of log up to and including
this offset:
# dd if=$LOGWRITES_DEV of=ext4-crash.log bs=512 count=153853

The exported log is 76M compressed to 34K and attached to this message.

To replay this log on your device and reproduce the fsck error, you need
a $SCRATCH_DEV that is at least 10G and run:

# bunzip2 ext4-crash.log.bz2
# ./src/log-writes/replay-log --log ext4-crash.log --replay
$SCRATCH_DEV --end-mark last

Now you can investigate state of file system right after the crash.
More importantly, you can use the --check [<number>|flush|fua] command line
options to replay-log to run an arbitrary check utility on replay device at any
numbers of checkpoints before the crash.
You can use --limit and --start-entry to fast forward to a given point
in recording
and replay with consistency checks from that point on, and so forth.

Finally, see that replaying the recording to "last" reproduces the problem:
# mount $SCRATCH_DEV $SCRATCH_MNT
# umount $SCRATCH_MNT
# e2fsck -nf $SCRATCH_DEV

Hope this helps.
I guess I could have saved Ted some work had I provided this guide
and log sooner, but I thought the bug may be trivial enough to understand
with just the dm-flakey reproducer -  I was wrong about that.

Cheers,
Amir.

Download attachment "ext4-crash.log.bz2" of type "application/x-bzip2" (34309 bytes)