linux-ext4 - Re: Lockup in wait_transaction_locked under memory pressure

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150630225851.GK7943@dastard>
Date:	Wed, 1 Jul 2015 08:58:51 +1000
From:	Dave Chinner <david@...morbit.com>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	Nikolay Borisov <kernel@...p.com>, Theodore Ts'o <tytso@....edu>,
	linux-ext4@...r.kernel.org, Marian Marinov <mm@...com>
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Tue, Jun 30, 2015 at 04:31:58PM +0200, Michal Hocko wrote:
> On Tue 30-06-15 14:30:33, Michal Hocko wrote:
> > On Tue 30-06-15 11:52:06, Dave Chinner wrote:
> > > On Mon, Jun 29, 2015 at 11:36:40AM +0200, Michal Hocko wrote:
> > > > On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> > > > > Today I observed the issue again, this time on a different server. What
> > > > > is particularly strange is the fact that the OOM wasn't triggered for
> > > > > the cgroup, whose tasks have entered D state. There were a couple of
> > > > > SSHD processes and an RSYNC performing some backup tasks. Here is what
> > > > > the stacktrace for the rsync looks like:
> > > > > 
> > > > > crash> set 18308
> > > > >     PID: 18308
> > > > > COMMAND: "rsync"
> > > > >    TASK: ffff883d7c9b0a30  [THREAD_INFO: ffff881773748000]
> > > > >     CPU: 1
> > > > >   STATE: TASK_UNINTERRUPTIBLE
> > > > > crash> bt
> > > > > PID: 18308  TASK: ffff883d7c9b0a30  CPU: 1   COMMAND: "rsync"
> > > > >  #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> > > > >  #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> > > > >  #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> > > > >  #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> > > > >  #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> > > > >  #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> > > > >  #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> > > > >  #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
> > > > 
> > > > This is most probably wait_on_page_writeback because the reclaim has
> > > > encountered a dirty page which is under writeback currently.
> > > 
> > > Yes, and looks at the caller path....
> > > 
> > > > >  #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > > > >  #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > > > > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > > > > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > > > > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > > > > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > > > > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > > > > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > > > > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > > > > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > > > > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > > > > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > > > > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > > > > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > > > > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > > > > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > > > > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > > > > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > > > > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > > > > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > > > > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > > > > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > > > > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > > > > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > > > > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
> > > 
> > > That's a potential self deadlocking path, isn't it? i.e. the
> > > writeback path has been entered, may hold pages locked in the
> > > current bio being built (waiting for submission), then memory
> > > reclaim has been entered while trying to map more contiguous blocks
> > > to submit, and that waits on page IO to complete on a page in a bio
> > > that ext4 hasn't yet submitted?
> > 
> > I am not sure I understand. Pages are marked writeback in
> > ext4_bio_write_page after all of this has been done already and then
> > the IO is submitted and the reclaim shouldn't block it. Or am I missing
> > something?
> 
> Thanks to Jan Kara for the off list clarification. I misunderstood the
> code. You are right ext4 is really deadlock prone. The heuristic in the
> reclaim code assumes that waiting on page_writeback is guaranteed to
> make a progress (from memcg POV) and that is not true for ext4 as it

*blink*

/me re-reads again

That assumption is fundamentally broken. Filesystems use GFP_NOFS
because the filesystem holds resources that can prevent memory
reclaim making forwards progress if it re-enters the filesystem or
blocks on anything filesystem related. memcg does not change that,
and I'm kinda scared to learn that memcg plays fast and loose like
this.

For example: IO completion might require unwritten extent conversion
which executes filesystem transactions and GFP_NOFS allocations. The
writeback flag on the pages can not be cleared until unwritten
extent conversion completes. Hence memory reclaim cannot wait on
page writeback to complete in GFP_NOFS context because it is not
safe to do so, memcg reclaim or otherwise.

> really charge after set_page_writeback (called from ext4_bio_write_page)
> and before the page is really submitted (when the bio is full or
> explicitly via ext4_io_submit). I thought that io_submit_add_bh submits
> the page but it doesn't do that necessarily.

XFS does exactly the same thing - the underlying alogrithm ext4 uses
to build large bios efficiently was copied from XFS. And FWIW XFS has
been using this algorithm since 2.6.15....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html