linux-ext4 - Re: ext4 fix for interaction between i_size, fallocate, and delalloc after a crash

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20171130152713.aeyxhns57lhou4de@thunk.org>
Date:   Thu, 30 Nov 2017 10:27:13 -0500
From:   Theodore Ts'o <tytso@....edu>
To:     Ashlie Martinez <ashmrtn@...xas.edu>
Cc:     Amir Goldstein <amir73il@...il.com>,
        Vijay Chidambaram <vvijay03@...il.com>,
        Ext4 <linux-ext4@...r.kernel.org>
Subject: Re: ext4 fix for interaction between i_size, fallocate, and delalloc
 after a crash

On Thu, Nov 30, 2017 at 08:51:45AM -0600, Ashlie Martinez wrote:
> 
> Even though CrashMonkey *records* all the disk operations, it doesn't
> have to replay all of them when generating crash states. For example,
> it could choose to fully replay (and preserve the ordering of) all
> operations before the 3rd barrier operation in a trace with 5
> different barrier operations in it (we dub each set of operations from
> just after the previous barrier operation up to and including the next
> barrier operation a "disk write epoch" or "disk epoch").

OK, but then it won't replay any operations after the 4th barrier
opration, correct?  And in the case where you are stopping somewhere
between the 3rd and the 4th replay, you will drop and reorder random
operations after the 3rd barrier op, but before the 4th, correct?

So what would be good is to understand where it stopped replaying
operations ---- and if you can get a strace -ttt of the workload, and
the fine-grained timestamps from block I/O trace, so we can understand
how far we had gotten in the workload.  And then, to also to include
the output of debugfs's "logdump -ac" command before the journal is
replayed.

>From a file system developer's perspective, what is most useful is
when we can see the minimal reproduction test case where Crashmonkey
is only rearranging block I/O's of the last full "disk write epock".
And getting the timestampped strace logs and block I/O logs will help
us do that.

Otherwise, we have no idea where to look for a potential problem, only
that it's one of the several I/O commands.

Anyway, looking back at your original question, is your question why
the first write isn't delay allocated?  That's because the
collapse_range operation needs to resolve any delayed allocation
writes on the portion of the extent tree which will be affected by the
collapse_range operation.  See the calls to
filemap_write_and_wait_range() in ext4_collapse_range().

Note that if you want to try to understand what is going on, there are
a large number of ext4 and jbd2 tracepoints.  Enabling these
tracepoints (you may need to omit some of the much more chatty jbd2
trace points from the ones that you enable) and dumping those
timestamps alongside the strace -ttt and block I/O timestamps should
be especially illuminating.

Cheers,

						- Ted

P.S.  Another set of tracepoints that might be useful for
understanding when delayed write allocations are getting resolved are
the writeback tracepoints --- although you can probably infer those
from the ext4_writepages traces, since when the writeback is triggered
this will trigger calls to ext4_writepages.