lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  PHC 
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:   Sat, 2 Dec 2017 14:00:23 -0600
From:   Ashlie Martinez <>
To:     "Theodore Ts'o" <>
Cc:     Amir Goldstein <>,
        Vijay Chidambaram <>,
        Ext4 <>
Subject: Re: ext4 fix for interaction between i_size, fallocate, and delalloc
 after a crash

On Thu, Nov 30, 2017 at 9:40 AM, Ashlie Martinez <> wrote:
> On Thu, Nov 30, 2017 at 9:27 AM, Theodore Ts'o <> wrote:
>> On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote:
>>> Even though CrashMonkey *records* all the disk operations, it doesn't
>>> have to replay all of them when generating crash states. For example,
>>> it could choose to fully replay (and preserve the ordering of) all
>>> operations before the 3rd barrier operation in a trace with 5
>>> different barrier operations in it (we dub each set of operations from
>>> just after the previous barrier operation up to and including the next
>>> barrier operation a "disk write epoch" or "disk epoch").
>> OK, but then it won't replay any operations after the 4th barrier
>> opration, correct?  And in the case where you are stopping somewhere
>> between the 3rd and the 4th replay, you will drop and reorder random
>> operations after the 3rd barrier op, but before the 4th, correct?
>> So what would be good is to understand where it stopped replaying
>> operations ---- and if you can get a strace -ttt of the workload, and
>> the fine-grained timestamps from block I/O trace, so we can understand
>> how far we had gotten in the workload.  And then, to also to include
>> the output of debugfs's "logdump -ac" command before the journal is
>> replayed.

It appears that the statements I added to provide extra debugging have
changed the timing on things and now I can't seem to get the same
output. I will keep poking at this later when I have more free time
and send you anything interesting that I find. Now that I know what
format to send material in, it should make things a little easier.

>> From a file system developer's perspective, what is most useful is
>> when we can see the minimal reproduction test case where Crashmonkey
>> is only rearranging block I/O's of the last full "disk write epock".
>> And getting the timestampped strace logs and block I/O logs will help
>> us do that.
>> Otherwise, we have no idea where to look for a potential problem, only
>> that it's one of the several I/O commands.
> These are good suggestions that I will work on adding into my current
> testing flow. Right now, I have CrashMonkey dump the recorded bios to
> a file in binary (so it can be hexdumped) and also record the bios
> that were used to generate each crash state it tests.
>> Anyway, looking back at your original question, is your question why
>> the first write isn't delay allocated?  That's because the
>> collapse_range operation needs to resolve any delayed allocation
>> writes on the portion of the extent tree which will be affected by the
>> collapse_range operation.  See the calls to
>> filemap_write_and_wait_range() in ext4_collapse_range().
> I also ran a few more tests on this, it appears that when fsx does
> mapwrite operations it calls msync after mmap (for obvious reasons).
> Adding the mmap/msync operations to the end of my CrashMonkey workload
> produced output similar to Amir's reported output, but with slightly
> different values for i_blocks. There may be another interaction there
> that we hadn't thought of based on this information. The top fsck
> output below is from CrashMonkey with mmap/msync at the end of the
> workload. The fsck output below that is from Amir's original post
> about the bug. Note that my test is running on a much smaller file
> system than Amir's, so the logical/physical blocks will differ.
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 12, end of extent exceeds allowed value
>   (logical block 131, physical block 8465, len 11)
> Clear? yes
> Inode 12, i_blocks is 168, should be 146.  Fix? yes
> e2fsck 1.42.13 (17-May-2015)
> Pass 1: Checking inodes, blocks, and sizes
> Inode 12, end of extent exceeds allowed value
>        (logical block 33, physical block 33441, len 7)
> Clear? no
> Inode 12, i_blocks is 184, should be 128.  Fix? no
>> Note that if you want to try to understand what is going on, there are
>> a large number of ext4 and jbd2 tracepoints.  Enabling these
>> tracepoints (you may need to omit some of the much more chatty jbd2
>> trace points from the ones that you enable) and dumping those
>> timestamps alongside the strace -ttt and block I/O timestamps should
>> be especially illuminating.
>> Cheers,
>>                                                 - Ted
>> P.S.  Another set of tracepoints that might be useful for
>> understanding when delayed write allocations are getting resolved are
>> the writeback tracepoints --- although you can probably infer those
>> from the ext4_writepages traces, since when the writeback is triggered
>> this will trigger calls to ext4_writepages.

Powered by blists - more mailing lists