[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAB0TPYGCOZmixbzrV80132X=V5TcyQwD6V7x-8PKg_BqCva8Og@mail.gmail.com>
Date: Fri, 22 May 2020 11:57:42 +0200
From: Martijn Coenen <maco@...roid.com>
To: Al Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>,
Jens Axboe <axboe@...nel.dk>, miklos@...redi.hu, tj@...nel.org
Cc: linux-fsdevel@...r.kernel.org, LKML <linux-kernel@...r.kernel.org>,
android-storage-core@...gle.com, kernel-team@...roid.com
Subject: Writeback bug causing writeback stalls
Hi,
We've been working on implementing a FUSE filesystem in Android, and have
run into what appears to be a bug in the kernel writeback code. The problem
that we observed is that an inode in the filesystem is on the b_dirty_time list,
but its i_state field does have I_DIRTY_PAGES set, which I think means that
the inode is on the wrong list. This condition doesn't appear to fix itself even
as new pages are dirtied, because __mark_inode_dirty() has this check:
if ((inode->i_state & flags) != flags) {
before considering moving the inode to another list. Since the inode already
has I_DIRTY_PAGES set, we're not going to move it to the dirty list. I *think*
the only way the inode gets out of this condition is whenever we handle the
b_dirty_time list, which can take a while.
The reason we even noticed this bug in the first place is that FUSE has a very
low wb max_ratio by default (1), and so at some point processes get stuck in
balance_dirty_pages_ratelimited(), waiting for pages to be written. They hold
the inode's write lock, and when other processes try to acquire it, they get
stuck. We have a watchdog that reboots the device after ~10 mins of a task
being stuck in D state, and this triggered regularly in some tests.
After careful studying of the kernel code, I found a reliable repro scenario
for this condition, which is described in more detail below. But essentially
what I think is happening is this:
__mark_inode_dirty() has an early return condition for when a sync is in
progress, where it updates the inode i_state but not the writeback list:
inode->i_state |= flags;
/*
* If the inode is being synced, just update its dirty state.
* The unlocker will place the inode on the appropriate
* superblock list, based upon its state.
*/
if (inode->i_state & I_SYNC)
goto out_unlock_inode;
now this comment is true for the generic flusher threads, which run
writeback_sb_inodes(), which calls requeue_inode() to move the inode back to the
correct wb list when the sync is done. However, there is another
function that uses
I_SYNC: writeback_single_inode(). This function has some comments saying it
prefers not to touch writeback lists, and in fact only removes it if the inode
is clean:
/*
* Skip inode if it is clean and we have no outstanding writeback in
* WB_SYNC_ALL mode. We don't want to mess with writeback lists in this
* function since flusher thread may be doing for example sync in
* parallel and if we move the inode, it could get skipped. So here we
* make sure inode is on some writeback list and leave it there unless
* we have completely cleaned the inode.
*/
writeback_single_inode() is called from a few functions, in particular
write_inode_now(). write_inode_now() is called by FUSE's flush() f_ops.
So, the sequence of events is something like this. Let's assume the inode is
already on b_dirty_time for valid reasons. Then:
CPU1 CPU2
fuse_flush()
write_inode_now()
writeback_single_inode()
sets I_SYNC
__writeback_single_inode()
writes back data
clears inode dirty flags
unlocks inode
calls mark_inode_dirty_sync()
sets I_DIRTY_SYNC, but doesn't
update wb list because I_SYNC is
still set
write() // somebody else writes
mark_inode_dirty(I_DIRTY_PAGES)
sets I_DIRTY_PAGES on i_state
doesn't update wb list,
because I_SYNC set
locks inode again
sees inode is still dirty,
doesn't touch WB list
clears I_SYNC
So now we have an inode on b_dirty_time with I_DIRTY_PAGES | I_DIRTY_SYNC set,
and subsequent calls to mark_inode_dirty() with either I_DIRTY_PAGES or
I_DIRTY_SYNC will do nothing to change that. The flusher won't touch
the inode either,
because it's not on a b_dirty or b_io list.
The easiest way to fix this, I think, is to call requeue_inode() at the end of
writeback_single_inode(), much like it is called from writeback_sb_inodes().
However, requeue_inode() has the following ominous warning:
/*
* Find proper writeback list for the inode depending on its current state and
* possibly also change of its state while we were doing writeback. Here we
* handle things such as livelock prevention or fairness of writeback among
* inodes. This function can be called only by flusher thread - noone else
* processes all inodes in writeback lists and requeueing inodes behind flusher
* thread's back can have unexpected consequences.
*/
Obviously this is very critical code both from a correctness and a performance
point of view, so I wanted to run this by the maintainers and folks who have
contributed to this code first.
The way I got to reproduce this reliably was by using what is pretty much a
pass-through FUSE filesystem, and the following two commands run in parallel:
[1] fio --rw=write --size=5G -blocksize=80000 --name=test --directory=/sdcard/
[2] while true; do echo flushme >> /sdcard/test.0.0; sleep 0.1; done
I doubt the blocksize matters, it's just the blocksize that I observed being
used in one of our testruns that hit this. [2] essentially calls fuse_flush()
every 100ms, which is often enough to reproduce this problem within seconds;
FIO will stall and enter balance_dirty_pages_ratelimited(), and [2] will hang
because it needs the inode write lock.
Other filesystems may hit the same problem, though write_inode_now() is usually
only used when no more dirty pages are expected (eg in final iput()). There are
some other functions that call writeback_single_inode() that are more
widely used,
like sync_inode() and sync_inode_metadata().
Curious to hear your thoughts on this. I'm happy to provide more info
or traces if
needed.
Thanks,
Martijn
Powered by blists - more mailing lists