linux-kernel - Re: linux-next: slab shrinkers: BUG at mm/list

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130708125352.GC20149@dhcp22.suse.cz>
Date:	Mon, 8 Jul 2013 14:53:52 +0200
From:	Michal Hocko <mhocko@...e.cz>
To:	Dave Chinner <david@...morbit.com>
Cc:	Glauber Costa <glommer@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>, linux-mm@...ck.org,
	LKML <linux-kernel@...r.kernel.org>
Subject: Re: linux-next: slab shrinkers: BUG at mm/list_lru.c:92

On Thu 04-07-13 18:36:43, Michal Hocko wrote:
> On Wed 03-07-13 21:24:03, Dave Chinner wrote:
> > On Tue, Jul 02, 2013 at 02:44:27PM +0200, Michal Hocko wrote:
> > > On Tue 02-07-13 22:19:47, Dave Chinner wrote:
> > > [...]
> > > > Ok, so it's been leaked from a dispose list somehow. Thanks for the
> > > > info, Michal, it's time to go look at the code....
> > > 
> > > OK, just in case we will need it, I am keeping the machine in this state
> > > for now. So we still can play with crash and check all the juicy
> > > internals.
> > 
> > My current suspect is the LRU_RETRY code. I don't think what it is
> > doing is at all valid - list_for_each_safe() is not safe if you drop
> > the lock that protects the list. i.e. there is nothing that protects
> > the stored next pointer from being removed from the list by someone
> > else. Hence what I think is occurring is this:
> > 
> > 
> > thread 1			thread 2
> > lock(lru)
> > list_for_each_safe(lru)		lock(lru)
> >   isolate			......
> >     lock(i_lock)
> >     has buffers
> >       __iget
> >       unlock(i_lock)
> >       unlock(lru)
> >       .....			(gets lru lock)
> >       				list_for_each_safe(lru)
> > 				  walks all the inodes
> > 				  finds inode being isolated by other thread
> > 				  isolate
> > 				    i_count > 0
> > 				      list_del_init(i_lru)
> > 				      return LRU_REMOVED;
> > 				   moves to next inode, inode that
> > 				   other thread has stored as next
> > 				   isolate
> > 				     i_state |= I_FREEING
> > 				     list_move(dispose_list)
> > 				     return LRU_REMOVED
> > 				 ....
> > 				 unlock(lru)
> >       lock(lru)
> >       return LRU_RETRY;
> >   if (!first_pass)
> >     ....
> >   --nr_to_scan
> >   (loop again using next, which has already been removed from the
> >   LRU by the other thread!)
> >   isolate
> >     lock(i_lock)
> >     if (i_state & ~I_REFERENCED)
> >       list_del_init(i_lru)	<<<<< inode is on dispose list!
> > 				<<<<< inode is now isolated, with I_FREEING set
> >       return LRU_REMOVED;
> > 
> > That fits the corpse left on your machine, Michal. One thread has
> > moved the inode to a dispose list, the other thread thinks it is
> > still on the LRU and should be removed, and removes it.
> > 
> > This also explains the lru item count going negative - the same item
> > is being removed from the lru twice. So it seems like all the
> > problems you've been seeing are caused by this one problem....
> > 
> > Patch below that should fix this.
> 
> Good news! The test was running since morning and it didn't hang nor
> crashed. So this really looks like the right fix. It will run also
> during weekend to be 100% sure. But I guess it is safe to say

Hmm, it seems I was too optimistic or we have yet another issue here (I
guess the later is more probable).

The weekend testing got stuck as well. 

The dmesg shows there were some hung tasks:
[275284.264312] start.sh (11025): dropped kernel caches: 3
[276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[276962.652087] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[276962.652093] xfs-data/sda9   D ffff88001ffb9cc8     0   930      2 0x00000000
[276962.652102]  ffff88003794d198 0000000000000046 ffff8800325f4480 0000000000000000
[276962.652113]  ffff88003794c010 0000000000012dc0 0000000000012dc0 0000000000012dc0
[276962.652121]  0000000000012dc0 ffff88003794dfd8 ffff88003794dfd8 0000000000012dc0
[276962.652128] Call Trace:
[276962.652151]  [<ffffffff812a2c22>] ? __blk_run_queue+0x32/0x40
[276962.652160]  [<ffffffff812a31f8>] ? queue_unplugged+0x78/0xb0
[276962.652171]  [<ffffffff815793a4>] schedule+0x24/0x70
[276962.652178]  [<ffffffff8157948c>] io_schedule+0x9c/0xf0
[276962.652187]  [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[276962.652194]  [<ffffffff815778ca>] __wait_on_bit+0x5a/0x90
[276962.652200]  [<ffffffff811011a0>] ? __lock_page+0x70/0x70
[276962.652206]  [<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[276962.652215]  [<ffffffff81067190>] ? autoremove_wake_function+0x40/0x40
[276962.652224]  [<ffffffff81112ee1>] ? page_evictable+0x11/0x50
[276962.652231]  [<ffffffff81114e43>] shrink_page_list+0x503/0x790
[276962.652239]  [<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[276962.652246]  [<ffffffff81115d5f>] ? shrink_active_list+0x29f/0x340
[276962.652254]  [<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[276962.652262]  [<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[276962.652274]  [<ffffffff81160c28>] ? mem_cgroup_reclaimable+0x108/0x150
[276962.652282]  [<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[276962.652291]  [<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[276962.652298]  [<ffffffff81116418>] shrink_zones+0x108/0x220
[276962.652305]  [<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[276962.652313]  [<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[276962.652323]  [<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[276962.652332]  [<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[276962.652343]  [<ffffffff81151c72>] kmem_getpages+0x62/0x1d0
[276962.652351]  [<ffffffff81153869>] fallback_alloc+0x189/0x250
[276962.652359]  [<ffffffff8115360d>] ____cache_alloc_node+0x8d/0x160
[276962.652367]  [<ffffffff81153e51>] __kmalloc+0x281/0x290
[276962.652490]  [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
[276962.652540]  [<ffffffffa02c6e97>] kmem_alloc+0x77/0xe0 [xfs]
[276962.652588]  [<ffffffffa02c6e97>] ? kmem_alloc+0x77/0xe0 [xfs]
[276962.652653]  [<ffffffffa030a334>] xfs_inode_item_format_extents+0x54/0x100 [xfs]
[276962.652714]  [<ffffffffa030a63a>] xfs_inode_item_format+0x25a/0x4f0 [xfs]
[276962.652774]  [<ffffffffa03081a0>] xlog_cil_prepare_log_vecs+0xa0/0x170 [xfs]
[276962.652834]  [<ffffffffa03082a8>] xfs_log_commit_cil+0x38/0x1c0 [xfs]
[276962.652894]  [<ffffffffa0303304>] xfs_trans_commit+0x74/0x260 [xfs]
[276962.652935]  [<ffffffffa02ac70b>] xfs_setfilesize+0x12b/0x130 [xfs]
[276962.652947]  [<ffffffff81076bd0>] ? __migrate_task+0x150/0x150
[276962.652988]  [<ffffffffa02ac985>] xfs_end_io+0x75/0xc0 [xfs]
[276962.652997]  [<ffffffff8105e934>] process_one_work+0x1b4/0x380
[276962.653004]  [<ffffffff8105f294>] rescuer_thread+0x234/0x320
[276962.653011]  [<ffffffff8105f060>] ? free_pwqs+0x30/0x30
[276962.653017]  [<ffffffff81066a86>] kthread+0xc6/0xd0
[276962.653025]  [<ffffffff810669c0>] ? kthread_freezable_should_stop+0x70/0x70
[276962.653034]  [<ffffffff8158303c>] ret_from_fork+0x7c/0xb0
[276962.653041]  [<ffffffff810669c0>] ? kthread_freezable_should_stop+0x70/0x70

$ dmesg | grep "blocked for more than"
[276962.652076] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[276962.653097] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
[276962.653940] INFO: task ld:14442 blocked for more than 480 seconds.
[276962.654297] INFO: task ld:14962 blocked for more than 480 seconds.
[277442.652123] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[277442.653153] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.
[277442.653997] INFO: task ld:14442 blocked for more than 480 seconds.
[277442.654353] INFO: task ld:14962 blocked for more than 480 seconds.
[277922.652069] INFO: task xfs-data/sda9:930 blocked for more than 480 seconds.
[277922.653089] INFO: task kworker/2:2:17823 blocked for more than 480 seconds.

All of them are sitting in io_schedule triggered from the memcg soft
reclaim waiting for a wake up (full dmesg is attached). I guess this has
nothing to do with the slab shrinkers directly. It is probably priority 0
reclaim which is done in the soft reclaim path.

$ uptime
 13:32pm  up 4 days  2:54,  2 users,  load average: 25.00, 24.97, 24.66

so the current timestamp should be 352854 which means that all of them
happened quite some time ago and the system obviously resurrected from
this state.

What is more important, though, is that we still have the following
tasks stuck in D state for hours:
14442 [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[<ffffffff81114e43>] shrink_page_list+0x503/0x790
[<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[<ffffffff81116418>] shrink_zones+0x108/0x220
[<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[<ffffffff8113fe93>] read_swap_cache_async+0x113/0x160
[<ffffffff8113ffe1>] swapin_readahead+0x101/0x190
[<ffffffff8112e93f>] do_swap_page+0xef/0x5e0
[<ffffffff8112f94d>] handle_pte_fault+0x1bd/0x240
[<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[<ffffffff8157b348>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
14962 [<ffffffff811011a9>] sleep_on_page+0x9/0x10
[<ffffffff8110150f>] wait_on_page_bit+0x6f/0x80
[<ffffffff81114e43>] shrink_page_list+0x503/0x790
[<ffffffff8111570b>] shrink_inactive_list+0x1bb/0x570
[<ffffffff81115ef9>] shrink_lruvec+0xf9/0x330
[<ffffffff8111660a>] mem_cgroup_shrink_node_zone+0xda/0x140
[<ffffffff81163382>] mem_cgroup_soft_reclaim+0xb2/0x140
[<ffffffff811634af>] mem_cgroup_soft_limit_reclaim+0x9f/0x270
[<ffffffff81116418>] shrink_zones+0x108/0x220
[<ffffffff8111776a>] do_try_to_free_pages+0x8a/0x360
[<ffffffff81117d90>] try_to_free_pages+0x130/0x180
[<ffffffff8110a2fe>] __alloc_pages_slowpath+0x39e/0x790
[<ffffffff8110a8ea>] __alloc_pages_nodemask+0x1fa/0x210
[<ffffffff8114d1b0>] alloc_pages_vma+0xa0/0x120
[<ffffffff81129ebb>] do_anonymous_page+0x16b/0x350
[<ffffffff8112f9c5>] handle_pte_fault+0x235/0x240
[<ffffffff8112fcbf>] handle_mm_fault+0x2ef/0x400
[<ffffffff8157e927>] __do_page_fault+0x237/0x4f0
[<ffffffff8157ebe9>] do_page_fault+0x9/0x10
[<ffffffff8157b348>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
20757 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c62d0>] xfs_free_eofblocks+0x180/0x250 [xfs]
[<ffffffffa02c68e6>] xfs_release+0x106/0x1d0 [xfs]
[<ffffffffa02b3b20>] xfs_file_release+0x10/0x20 [xfs]
[<ffffffff8116c86d>] __fput+0xbd/0x240
[<ffffffff8116ca49>] ____fput+0x9/0x10
[<ffffffff81063221>] task_work_run+0xb1/0xe0
[<ffffffff810029e0>] do_notify_resume+0x90/0x1d0
[<ffffffff815833a2>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff
20758 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
[<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff811763dd>] vfs_create+0xad/0xd0
[<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff
20761 [<ffffffffa0305fdd>] xlog_grant_head_wait+0xdd/0x1a0 [xfs]
[<ffffffffa0306166>] xlog_grant_head_check+0xc6/0xe0 [xfs]
[<ffffffffa030627f>] xfs_log_reserve+0xff/0x240 [xfs]
[<ffffffffa0302ac4>] xfs_trans_reserve+0x234/0x240 [xfs]
[<ffffffffa02c5999>] xfs_create+0x1a9/0x5c0 [xfs]
[<ffffffffa02bccca>] xfs_vn_mknod+0x8a/0x1a0 [xfs]
[<ffffffffa02bce0e>] xfs_vn_create+0xe/0x10 [xfs]
[<ffffffff811763dd>] vfs_create+0xad/0xd0
[<ffffffff81177e68>] lookup_open+0x1b8/0x1d0
[<ffffffff8117815e>] do_last+0x2de/0x780
[<ffffffff8117ae9a>] path_openat+0xda/0x400
[<ffffffff8117b303>] do_filp_open+0x43/0xa0
[<ffffffff81168ee0>] do_sys_open+0x160/0x1e0
[<ffffffff81168f9c>] sys_open+0x1c/0x20
[<ffffffff815830e9>] system_call_fastpath+0x16/0x1b
[<ffffffffffffffff>] 0xffffffffffffffff

Hohmm, now that I am looking at pids of the stuck processes, two of them
14442 and 14962 are mentioned in the soft lockup warnings. It is really
weird that the lockups have stopped quite some time ago (~20h ago).

I am keeping the system in this state in case you want to examine
details via crash again.

Let me know whether you need any further details.

Thanks!
-- 
Michal Hocko
SUSE Labs

View attachment "demon.dmesg" of type "text/plain" (126793 bytes)