linux-kernel - Re: frequent softlockups with 3.10rc6.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130702081937.GA31770@quack.suse.cz>
Date:	Tue, 2 Jul 2013 10:19:37 +0200
From:	Jan Kara <jack@...e.cz>
To:	Dave Chinner <david@...morbit.com>
Cc:	Jan Kara <jack@...e.cz>, Dave Jones <davej@...hat.com>,
	Oleg Nesterov <oleg@...hat.com>,
	"Paul E. McKenney" <paulmck@...ux.vnet.ibm.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>,
	Linus Torvalds <torvalds@...ux-foundation.org>,
	"Eric W. Biederman" <ebiederm@...ssion.com>,
	Andrey Vagin <avagin@...nvz.org>,
	Steven Rostedt <rostedt@...dmis.org>
Subject: Re: frequent softlockups with 3.10rc6.

On Tue 02-07-13 16:29:54, Dave Chinner wrote:
> > > We could, but we just end up in the same place with sync as we are
> > > now - with a long list of clean inodes with a few inodes hidden in
> > > it that are under IO. i.e. we still have to walk lots of clean
> > > inodes to find the dirty ones that we need to wait on....
> >   If the syncs are rare then yes. If they are relatively frequent, you
> > would win because the first sync will cleanup the list and subsequent ones
> > will be fast.
> 
> I haven't done this yet, because I've found an interesting
> performance problem with our sync implementation. Basically, sync(2)
> on a filesystem that is being constantly dirtied blocks the flusher
> thread waiting for IO completion like so:
> 
> # echo w > /proc/sysrq-trigger 
> [ 1968.031001] SysRq : Show Blocked State
> [ 1968.032748]   task                        PC stack   pid father
> [ 1968.034534] kworker/u19:2   D ffff8800bed13140  3448  4830      2 0x00000000
> [ 1968.034534] Workqueue: writeback bdi_writeback_workfn (flush-253:32)
> [ 1968.034534]  ffff8800bdca3998 0000000000000046 ffff8800bd1cae20 ffff8800bdca3fd8
> [ 1968.034534]  ffff8800bdca3fd8 ffff8800bdca3fd8 ffff88003ea10000 ffff8800bd1cae20
> [ 1968.034534]  ffff8800bdca3968 ffff8800bd1cae20 ffff8800bed139a0 0000000000000002
> [ 1968.034534] Call Trace:
> [ 1968.034534]  [<ffffffff81bff7c9>] schedule+0x29/0x70
> [ 1968.034534]  [<ffffffff81bff89f>] io_schedule+0x8f/0xd0
> [ 1968.034534]  [<ffffffff8113263e>] sleep_on_page+0xe/0x20
> [ 1968.034534]  [<ffffffff81bfd030>] __wait_on_bit+0x60/0x90
> [ 1968.034534]  [<ffffffff81132770>] wait_on_page_bit+0x80/0x90
> [ 1968.034534]  [<ffffffff81132881>] filemap_fdatawait_range+0x101/0x190
> [ 1968.034534]  [<ffffffff81132937>] filemap_fdatawait+0x27/0x30
> [ 1968.034534]  [<ffffffff811a7f18>] __writeback_single_inode+0x1b8/0x220
> [ 1968.034534]  [<ffffffff811a88ab>] writeback_sb_inodes+0x27b/0x410
> [ 1968.034534]  [<ffffffff811a8c00>] wb_writeback+0xf0/0x2c0
> [ 1968.034534]  [<ffffffff811aa668>] wb_do_writeback+0xb8/0x210
> [ 1968.034534]  [<ffffffff811aa832>] bdi_writeback_workfn+0x72/0x160
> [ 1968.034534]  [<ffffffff8109e487>] process_one_work+0x177/0x400
> [ 1968.034534]  [<ffffffff8109eb82>] worker_thread+0x122/0x380
> [ 1968.034534]  [<ffffffff810a5508>] kthread+0xd8/0xe0
> [ 1968.034534]  [<ffffffff81c091ec>] ret_from_fork+0x7c/0xb0
> 
> i.e. this code:
> 
> static int
> __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
> {
>         struct address_space *mapping = inode->i_mapping;
>         long nr_to_write = wbc->nr_to_write;
>         unsigned dirty;
>         int ret;
> 
>         WARN_ON(!(inode->i_state & I_SYNC));
> 
>         trace_writeback_single_inode_start(inode, wbc, nr_to_write);
> 
>         ret = do_writepages(mapping, wbc);
> 
>         /*
>          * Make sure to wait on the data before writing out the metadata.
>          * This is important for filesystems that modify metadata on data
>          * I/O completion.
>          */
>         if (wbc->sync_mode == WB_SYNC_ALL) {
>                 int err = filemap_fdatawait(mapping);
>                 if (ret == 0)
>                         ret = err;
>         }
> ....
> 
> If completely serialising IO dispatch during sync. We are not
> batching IO submission at all - we are dispatching it one inode at a
> time an then waiting for it to complete.  Guess where in the
> benchmark run I ran sync:
> 
> FSUse%        Count         Size    Files/sec     App Overhead
> .....
>      0       640000         4096      35154.6          1026984
>      0       720000         4096      36740.3          1023844
>      0       800000         4096      36184.6           916599
>      0       880000         4096       1282.7          1054367
>      0       960000         4096       3951.3           918773
>      0      1040000         4096      40646.2           996448
>      0      1120000         4096      43610.1           895647
>      0      1200000         4096      40333.1           921048
> 
> sync absolutely *murders* asynchronous IO performance right now
> because it stops background writeback completely and stalls all new
> writes in balance_dirty_pages like:
> 
> [ 1968.034534] fs_mark         D ffff88007ed13140  3680  9219   7127 0x00000000
> [ 1968.034534]  ffff88005a279a38 0000000000000046 ffff880040318000 ffff88005a279fd8
> [ 1968.034534]  ffff88005a279fd8 ffff88005a279fd8 ffff88003e9fdc40 ffff880040318000
> [ 1968.034534]  ffff88005a279a28 ffff88005a279a70 ffff88007e9e0000 0000000100065d20
> [ 1968.034534] Call Trace:
> [ 1968.034534]  [<ffffffff81bff7c9>] schedule+0x29/0x70
> [ 1968.034534]  [<ffffffff81bfcd3b>] schedule_timeout+0x10b/0x1f0
> [ 1968.034534]  [<ffffffff81bfe492>] io_schedule_timeout+0xa2/0x100
> [ 1968.034534]  [<ffffffff8113d6fb>] balance_dirty_pages_ratelimited+0x37b/0x7a0
> [ 1968.034534]  [<ffffffff811322e8>] generic_file_buffered_write+0x1b8/0x280
> [ 1968.034534]  [<ffffffff8144e649>] xfs_file_buffered_aio_write+0x109/0x1a0
> [ 1968.034534]  [<ffffffff8144e7ae>] xfs_file_aio_write+0xce/0x140
> [ 1968.034534]  [<ffffffff8117f4b0>] do_sync_write+0x80/0xb0
> [ 1968.034534]  [<ffffffff811801c1>] vfs_write+0xc1/0x1c0
> [ 1968.034534]  [<ffffffff81180642>] SyS_write+0x52/0xa0
> [ 1968.034534]  [<ffffffff81c09299>] system_call_fastpath+0x16/0x1b
> 
> IOWs, blocking the flusher thread for IO completion on WB_SYNC_ALL
> writeback is very harmful. Given that we rely on ->sync_fs to
> guarantee all inode metadata is written back - the async pass up
> front doesn't do any waiting so any inode metadata updates done
> after IO completion have to be caught by ->sync_fs - why are we
> doing IO completion waiting here for sync(2) writeback?
  So I did a bit of digging in history and the wait in
__writeback_single_inode() (at that time it was just
writeback_single_inode()) has been introduced by Christoph in commit
26821ed40. It is there for calls like sync_inode() or write_inode_now()
where it really is necessary.

You are right that for syncing the whole filesystem like sync(2) does, the
wait in __writeback_single_inode() isn't necessary. After all, all the
inode data might have been written back asynchronously so never have to see
the inode in __writeback_single_inode() when sync_mode == WB_SYNC_ALL and
each filesystem has to make sure inode metadata is correctly on disk. So
removing that wait for global sync isn't going to introduce any new
problem.

Now how to implement that practically - __writeback_single_inode() is
called from two places. From writeback_single_inode() - that is the
function for writing only a single inode, and from writeback_sb_inodes()
- that is used by flusher threads. I'd be inclined to move do_writepages()
call from __writeback_single_inode() to the callsites and move the wait to
writeback_single_inode() only. But that would mean also moving of
tracepoints and it starts to get a bit ugly. Maybe we could instead create
a new flag in struct writeback_control to indicate that we are doing global
sync - for_sync flag? Then we could use that flag in
__writeback_single_inode() to decide whether to wait or not.

As a bonus filesystems could also optimize their write_inode() methods when
they know ->sync_fs() is going to happen in future. E.g. ext4 wouldn't have
to do the stupid ext4_force_commit() after each written inode in
WB_SYNC_ALL mode.

								Honza
-- 
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/