linux-kernel - Re: [PATCH block/for-linus] writeback: fix syncing of I_DIRTY

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20150820230451.GT714@dastard>
Date:	Fri, 21 Aug 2015 09:04:51 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Eryu Guan <eguan@...hat.com>, Jens Axboe <axboe@...nel.dk>,
	Jan Kara <jack@...e.cz>, linux-kernel@...r.kernel.org,
	xfs@....sgi.com, axboe@...com, Jan Kara <jack@...e.com>,
	linux-fsdevel@...r.kernel.org, kernel-team@...com
Subject: Re: [PATCH block/for-linus] writeback: fix syncing of I_DIRTY_TIME
 inodes

On Thu, Aug 20, 2015 at 09:55:37AM -0700, Tejun Heo wrote:
> Hello, Eryu.  Thanks a lot for the trace.
> 
> So, this is from the end of the trace from the failed test.
> 
> ...
>     kworker/u8:1-1563  [002] 22016.987530: xfs_writepage:        dev 253:6 ino 0xef64fe pgoff 0x9ff000 size 0xa00000 offset 0 length 0 delalloc 1 unwritten 0
>      kworker/2:1-49    [002] 22017.373595: xfs_setfilesize:      dev 253:6 ino 0xef6504 isize 0xa00000 disize 0x0 offset 0x0 count 10481664
> ...
> 
> Maybe I'm misunderstanding the code but all xfs_writepage() calls are
> from unbound workqueues - the writeback workers - while
> xfs_setfilesize() are from bound workqueues, so I wondered why that
> was and looked at the code and the setsize functions are run off of a
> separate work item which is queued from the end_bio callback and I
> can't tell who would be waiting for them.  Dave, what am I missing?

xfs_setfilesize runs transactions, so it can't be run from IO
completion context as it needs to block (i.e. on log space or inode
locks). It also can't block log IO completion, nor metadata Io
completion, as only log IO completion can free log space, and the
inode lock might be waiting on metadata buffer IO completion (e.g.
during delayed allocation). Hence we have multiple IO completion
workqueues to keep these things separated and deadlock free. i.e.
they all get punted to a workqueue where they are then processed in
a context that can block safely.

>     kworker/u8:1-1563  [002] 22016.987530: xfs_writepage:        dev 253:6 ino 0xef64fe pgoff 0x9ff000 size 0xa00000 offset 0 length 0 delalloc 1 unwritten 0

There will be one of these per page that is submitted to XFS. There
won't be one per page, because XFS clusters writes itself. This
trace is telling us that the page at offset 0x9ff000 was submitted,
the in-memory size of the inode at this time is 0xa00000 (i.e. this
is the last dirty page in memory) and that the it is a delayed
allocation extent (i.e. hasn't been written before).

>      kworker/2:1-49    [002] 22017.373595: xfs_setfilesize:      dev 253:6 ino 0xef6504 isize 0xa00000 disize 0x0 offset 0x0 count 10481664

There will be one of these per IO completion that extents the inode
size. This one tells us the in-memory inode size is 0xa00000, the
current on-disk inode size is 0, and the IO being completed spans
the offsets 0 to 10481664 (0x9ff000). Which means it does not
include the page submitted by the above trace, and after the setsize
transaction, isize=0xa00000 and disize=0x9ff000.

Note that these two traces are from different inodes - you need to
match traces from "ino 0xef6504" with other traces from the same
inode.

Also, note that the trace is not complete - there are many, many
missing trace events in the output....

What is interesting from the trace is that all the file size updates
have this pattern:

    kworker/2:1-49    [002] 22017.377918: xfs_setfilesize:      dev 253:6 ino 0xef64fd isize 0xa00000 disize 0x0 offset 0x0 count 10481664
    kworker/2:1-49    [002] 22017.378438: xfs_setfilesize:	dev 253:6 ino 0xef64fd isize 0xa00000 disize 0x9ff000 offset 0x9ff000 count 4096

There are two IOs being done - one for everything but the last page,
and one for the last page. This is either a result of the writeback
context limiting the number of pages per writeback slice, or the
page clustering that XFS does in xfs_vm_writepage() not quite
getting everything right (maybe an off-by-one?).

However, this doesn't appear to be a contributing factor. The 9
files that have the wrong file size at the end of the test match up
exactly with the last 9 writepage submissions and IO completions;
they happen after all the IO completions occur for all the good
files.

This implies that the sync is either not submitting all the inodes
for IO correctly or it is not waiting for all the inodes it
submitted to be marked clean. We really need the writeback control
tracepoints in the output to determine exactly what the sync was
doing when it submitted these last inodes for writeback....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/