linux-kernel - Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-Id: <201407022140.BFJ13092.QVOSJtFMFHLOFO@I-love.SAKURA.ne.jp>
Date:	Wed, 2 Jul 2014 21:40:52 +0900
From:	Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
To:	david@...morbit.com
Cc:	rientjes@...gle.com, Motohiro.Kosaki@...fujitsu.com,
	riel@...hat.com, kosaki.motohiro@...fujitsu.com,
	fengguang.wu@...el.com, kamezawa.hiroyu@...fujitsu.com,
	akpm@...ux-foundation.org, hch@...radead.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	fernando_b1@....ntt.co.jp
Subject: Re: [PATCH] mm/vmscan: Do not block forever at shrink_inactive_list().

Tetsuo Handa wrote:
> Here is a demo patch. If you can join analysis of why memory allocation
> function cannot return for more than 15 minutes under severe memory pressure,
> I'll invite you to private discussion in order to share steps for reproducing
> such memory pressure. A quick test says that memory reclaiming functions are
> too optimistic about reclaiming memory; they are needlessly called again and
> again and again with an assumption that some memory will be reclaimed within
> a few seconds. If I insert some delay, CPU usage during stalls can be reduced.

Here is a formal patch. This patch includes a test result of today's linux.git
tree with https://lkml.org/lkml/2014/5/29/673 applied, in order to find what
deadlock occurs next. The blocking delay on the mutex inside the ttm shrinker
has gone, but a kernel worker thread trying to perform a block I/O using
GFP_NOIO context is blocked for more than 10 minutes. I think this is not good.

---------- Start of patch ----------

>>From c5274057bd71832fcf0baef64d43a49c20f29dbf Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
Date: Wed, 2 Jul 2014 09:34:51 +0900
Subject: [PATCH] mm: Remember ongoing memory allocation status.

When a stall by memory allocation problem occurs, printing how long
a thread blocked for memory allocation will be useful.

This patch allows remembering how many jiffies was spent for ongoing
__alloc_pages_nodemask() and reading it by printing backtrace and by
analyzing kdump.

Two examples are shown below. You can see that the GFP flags passed to
memory shrinker functions can be GFP_NOIO or GFP_NOFS. Therefore, when
writing memory shrinker functions, please be careful with dependency
inside shrinker functions. For example, unconditional use of GFP_KERNEL
may lead to deadlock. For another example, unconditional use of
blocking lock operations (e.g. mutex_lock()) which are called by
multiple different GFP contexts may lead to deadlock.

     kworker/2:2     R  running task        0   189      2 0x00000000
     MemAlloc: 624869 jiffies on 0x10
     Workqueue: events_freezable_power_ disk_events_workfn
      ffff880036eacfe0 000000004486d7e5 ffff88007fc83c48 ffffffff81090a3f
      ffff880036eacfe0 0000000000000000 ffff88007fc83c80 ffffffff81090b35
      ffff880036ead210 000000004486d7e5 ffffffff817bada0 0000000000000074
     Call Trace:
      [<ffffffff8158401f>] ? _raw_spin_lock+0x2f/0x50
      [<ffffffff81126b99>] list_lru_count_node+0x19/0x60
      [<ffffffff81171e10>] super_cache_count+0x50/0xd0
      [<ffffffff8111460a>] shrink_slab_node+0x3a/0x1b0
      [<ffffffff811683fc>] ? vmpressure+0x1c/0x80
      [<ffffffff811153f3>] shrink_slab+0x83/0x150
      [<ffffffff81118499>] do_try_to_free_pages+0x2f9/0x530
      [<ffffffff81118768>] try_to_free_pages+0x98/0xd0
      [<ffffffff8110e3f3>] __alloc_pages_nodemask+0x6e3/0xad0
      [<ffffffff8114b2b3>] alloc_pages_current+0xa3/0x170
      [<ffffffff81244d87>] bio_copy_user_iov+0x1c7/0x370
      [<ffffffff81244fc9>] bio_copy_kern+0x49/0xe0
      [<ffffffff8124ed4f>] blk_rq_map_kern+0x6f/0x130
      [<ffffffff81249273>] ? blk_get_request+0x83/0x140
      [<ffffffff81393381>] scsi_execute+0x131/0x160
      [<ffffffff81393484>] scsi_execute_req_flags+0x84/0xf0
      [<ffffffffa01b987c>] sr_check_events+0xbc/0x2d0 [sr_mod]
      [<ffffffffa018f173>] cdrom_check_events+0x13/0x30 [cdrom]
      [<ffffffffa01b9ced>] sr_block_check_events+0x2d/0x30 [sr_mod]
      [<ffffffff81258c75>] disk_check_events+0x55/0x1e0
      [<ffffffff81580e65>] ? _cond_resched+0x35/0x60
      [<ffffffff81258e11>] disk_events_workfn+0x11/0x20
      [<ffffffff8107d64f>] process_one_work+0x15f/0x3d0
      [<ffffffff8107de19>] worker_thread+0x119/0x620
      [<ffffffff8107dd00>] ? rescuer_thread+0x440/0x440
      [<ffffffff8108439c>] kthread+0xdc/0x100
      [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
      [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
      [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
    
     kworker/u16:2   R  running task        0 14009  13723 0x00000080
     MemAlloc: 624951 jiffies on 0x250
      0000000000000000 0000000000000100 0000000000000000 28f5c28f5c28f5c3
      0000000000001705 0000000000000060 0000000000000064 0000000000000064
      ffff880036dfea40 ffffffffffffff10 ffffffff8158401a 0000000000000010
     Call Trace:
      [<ffffffff8158401a>] ? _raw_spin_lock+0x2a/0x50
      [<ffffffff81126b99>] ? list_lru_count_node+0x19/0x60
      [<ffffffff81171e10>] ? super_cache_count+0x50/0xd0
      [<ffffffff8111460a>] ? shrink_slab_node+0x3a/0x1b0
      [<ffffffff811683fc>] ? vmpressure+0x1c/0x80
      [<ffffffff811153f3>] ? shrink_slab+0x83/0x150
      [<ffffffff81118499>] ? do_try_to_free_pages+0x2f9/0x530
      [<ffffffff81118768>] ? try_to_free_pages+0x98/0xd0
      [<ffffffff8110e3f3>] ? __alloc_pages_nodemask+0x6e3/0xad0
      [<ffffffff8114b2b3>] ? alloc_pages_current+0xa3/0x170
      [<ffffffffa0232755>] ? xfs_buf_allocate_memory+0x168/0x245 [xfs]
      [<ffffffffa01cc382>] ? xfs_buf_get_map+0xd2/0x130 [xfs]
      [<ffffffffa01cc964>] ? xfs_buf_read_map+0x24/0xc0 [xfs]
      [<ffffffffa0228609>] ? xfs_trans_read_buf_map+0xa9/0x330 [xfs]
      [<ffffffffa0217999>] ? xfs_imap_to_bp+0x69/0xf0 [xfs]
      [<ffffffffa0217e89>] ? xfs_iread+0x79/0x410 [xfs]
      [<ffffffffa01e35df>] ? kmem_zone_alloc+0x6f/0xf0 [xfs]
      [<ffffffffa01d3be3>] ? xfs_iget+0x1a3/0x510 [xfs]
      [<ffffffffa02121de>] ? xfs_lookup+0xbe/0xf0 [xfs]
      [<ffffffffa01d9023>] ? xfs_vn_lookup+0x73/0xc0 [xfs]
      [<ffffffff81178f88>] ? lookup_real+0x18/0x50
      [<ffffffff8117dced>] ? do_last+0x8bd/0xe90
      [<ffffffff8117adde>] ? link_path_walk+0x27e/0x8e0
      [<ffffffff8117e388>] ? path_openat+0xc8/0x6a0
      [<ffffffff8109700c>] ? select_task_rq_fair+0x3dc/0x7e0
      [<ffffffff8117fc18>] ? do_filp_open+0x48/0xb0
      [<ffffffff81154799>] ? kmem_cache_alloc+0x109/0x170
      [<ffffffff81208b51>] ? security_prepare_creds+0x11/0x20
      [<ffffffff811751ad>] ? do_open_exec+0x1d/0xe0
      [<ffffffff8117704d>] ? do_execve_common.isra.26+0x1bd/0x620
      [<ffffffff81154700>] ? kmem_cache_alloc+0x70/0x170
      [<ffffffff811774c3>] ? do_execve+0x13/0x20
      [<ffffffff81079ae7>] ? ____call_usermodehelper+0x117/0x1b0
      [<ffffffff81079b80>] ? ____call_usermodehelper+0x1b0/0x1b0
      [<ffffffff81079b99>] ? call_helper+0x19/0x20
      [<ffffffff8158483c>] ? ret_from_fork+0x7c/0xb0
      [<ffffffff81079b80>] ? ____call_usermodehelper+0x1b0/0x1b0

Signed-off-by: Tetsuo Handa <penguin-kernel@...ove.SAKURA.ne.jp>
---
 include/linux/sched.h |  2 ++
 kernel/sched/core.c   | 11 +++++++++++
 mm/page_alloc.c       | 19 +++++++++++++++++--
 3 files changed, 30 insertions(+), 2 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 306f4f0..8b5edc7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1655,6 +1655,8 @@ struct task_struct {
 	unsigned int	sequential_io;
 	unsigned int	sequential_io_avg;
 #endif
+	unsigned long memory_allocation_start_jiffies;
+	gfp_t memory_allocation_flags;
 };
 
 /* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3bdf01b..0d1eb3e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4443,6 +4443,16 @@ out_unlock:
 	return retval;
 }
 
+static void print_memalloc_info(const struct task_struct *p)
+{
+	const unsigned long stamp = p->memory_allocation_start_jiffies;
+
+	if (likely(!stamp))
+		return;
+	printk(KERN_INFO "MemAlloc: %lu jiffies on 0x%x\n", jiffies - stamp,
+	       p->memory_allocation_flags);
+}
+
 static const char stat_nam[] = TASK_STATE_TO_CHAR_STR;
 
 void sched_show_task(struct task_struct *p)
@@ -4475,6 +4485,7 @@ void sched_show_task(struct task_struct *p)
 		task_pid_nr(p), ppid,
 		(unsigned long)task_thread_info(p)->flags);
 
+	print_memalloc_info(p);
 	print_worker_info(KERN_INFO, p);
 	show_stack(p, NULL);
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 20d17f8..cac0d32 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2721,6 +2721,17 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	unsigned int cpuset_mems_cookie;
 	int alloc_flags = ALLOC_WMARK_LOW|ALLOC_CPUSET|ALLOC_FAIR;
 	int classzone_idx;
+	bool memory_allocation_recursion = false;
+	unsigned long *stamp = &current->memory_allocation_start_jiffies;
+
+	if (likely(!*stamp)) {
+		*stamp = jiffies;
+		if (unlikely(!*stamp))
+			(*stamp)--;
+		current->memory_allocation_flags = gfp_mask;
+	} else {
+		memory_allocation_recursion = true;
+	}
 
 	gfp_mask &= gfp_allowed_mask;
 
@@ -2729,7 +2740,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	might_sleep_if(gfp_mask & __GFP_WAIT);
 
 	if (should_fail_alloc_page(gfp_mask, order))
-		return NULL;
+		goto nopage;
 
 	/*
 	 * Check the zones suitable for the gfp_mask contain at least one
@@ -2737,7 +2748,7 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
 	 * of GFP_THISNODE and a memoryless node
 	 */
 	if (unlikely(!zonelist->_zonerefs->zone))
-		return NULL;
+		goto nopage;
 
 retry_cpuset:
 	cpuset_mems_cookie = read_mems_allowed_begin();
@@ -2799,6 +2810,10 @@ out:
 	if (unlikely(!page && read_mems_allowed_retry(cpuset_mems_cookie)))
 		goto retry_cpuset;
 
+nopage:
+	if (likely(!memory_allocation_recursion))
+		current->memory_allocation_start_jiffies = 0;
+
 	return page;
 }
 EXPORT_SYMBOL(__alloc_pages_nodemask);
-- 
1.8.3.1

---------- End of patch ----------

Above result is 2GB RAM and no swap space while below result is 2GB RAM and
4GB swap space.

If swap space is available, CPU usage during stalls tend to become 100% to
0% as waiting for disk I/O and congestion_wait() make processes sleep.

I succeeded to generate (so far only once) a CPU 0% stall that lasted for more
than 20 minutes blocked at congestion_wait() inside shrink_inactive_list().

    kthreadd        D ffff88007fcd31c0     0     2      0 0x00000000
    MemAlloc: 1374202 jiffies on 0x2000d0
     ffff88007c41f7c0 0000000000000046 ffff88007c4088e0 00000000000131c0
     ffff88007c41ffd8 00000000000131c0 ffff88007c46f360 0000000000000286
     ffff88007a14fc02 ffff88007c41f730 ffffffff81168040 ffffffff819cc300
    Call Trace:
     [<ffffffff81168040>] ? swap_cgroup_record+0x50/0x80
     [<ffffffff8106f766>] ? lock_timer_base.isra.26+0x26/0x50
     [<ffffffff81580b64>] schedule+0x24/0x70
     [<ffffffff8157ff26>] schedule_timeout+0x126/0x1c0
     [<ffffffff810be5d3>] ? ktime_get_ts+0x43/0xe0
     [<ffffffff8106f2c0>] ? add_timer_on+0xa0/0xa0
     [<ffffffff815810e6>] io_schedule_timeout+0x96/0xf0
     [<ffffffff8112123d>] congestion_wait+0x7d/0xd0
     [<ffffffff810a3da0>] ? __wake_up_sync+0x10/0x10
     [<ffffffff81116f0d>] shrink_inactive_list+0x37d/0x550
     [<ffffffff81117abb>] shrink_lruvec+0x52b/0x730
     [<ffffffff81117d54>] shrink_zone+0x94/0x1e0
     [<ffffffff811182c8>] do_try_to_free_pages+0x128/0x530
     [<ffffffff81118768>] try_to_free_pages+0x98/0xd0
     [<ffffffff8110e3f3>] __alloc_pages_nodemask+0x6e3/0xad0
     [<ffffffff8110e914>] alloc_kmem_pages_node+0x74/0x160
     [<ffffffff810617d5>] ? copy_process.part.32+0x125/0x1bb0
     [<ffffffff810617f6>] copy_process.part.32+0x146/0x1bb0
     [<ffffffff815805db>] ? __schedule+0x29b/0x800
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
     [<ffffffff81063427>] do_fork+0xd7/0x340
     [<ffffffff81091696>] ? set_cpus_allowed_ptr+0x76/0x120
     [<ffffffff810636b1>] kernel_thread+0x21/0x30
     [<ffffffff81084daa>] kthreadd+0x16a/0x1d0
     [<ffffffff81084c40>] ? kthread_create_on_cpu+0x60/0x60
     [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
     [<ffffffff81084c40>] ? kthread_create_on_cpu+0x60/0x60

    kswapd0         S ffff88007fc931c0     0    53      2 0x00000000
     ffff880079f17e10 0000000000000046 ffff88007c119aa0 00000000000131c0
     ffff880079f17fd8 00000000000131c0 ffff88007c46ea80 ffff880079f17dc0
     ffff880079f17dc0 0000000000000286 ffff88007c50c000 ffff880079f17d88
    Call Trace:
     [<ffffffff8106fc62>] ? try_to_del_timer_sync+0x52/0x80
     [<ffffffff8110a22c>] ? zone_watermark_ok_safe+0xac/0xc0
     [<ffffffff811150e9>] ? zone_balanced+0x19/0x50
     [<ffffffff811151ef>] ? pgdat_balanced+0xcf/0xf0
     [<ffffffff81580b64>] schedule+0x24/0x70
     [<ffffffff81119329>] kswapd+0x2f9/0x3c0
     [<ffffffff810a3da0>] ? __wake_up_sync+0x10/0x10
     [<ffffffff81119030>] ? balance_pgdat+0x640/0x640
     [<ffffffff8108439c>] kthread+0xdc/0x100
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
     [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0

    kworker/u16:1   D ffff88007c11e1a0     0    65      2 0x00000000
    MemAlloc: 1455121 jiffies on 0x2000d0
    Workqueue: khelper __call_usermodehelper
     ffff880036c0b6b0 0000000000000046 ffff88007c11e1a0 00000000000131c0
     ffff880036c0bfd8 00000000000131c0 ffff88007c2091c0 ffff880036c0b668
     ffffea0001ff6a40 000000000000001d ffff88007cffec00 ffffea0001dfb000
    Call Trace:
     [<ffffffff8106f766>] ? lock_timer_base.isra.26+0x26/0x50
     [<ffffffff81580b64>] schedule+0x24/0x70
     [<ffffffff8157ff26>] schedule_timeout+0x126/0x1c0
     [<ffffffff810be5d3>] ? ktime_get_ts+0x43/0xe0
     [<ffffffff8106f2c0>] ? add_timer_on+0xa0/0xa0
     [<ffffffff815810e6>] io_schedule_timeout+0x96/0xf0
     [<ffffffff8112123d>] congestion_wait+0x7d/0xd0
     [<ffffffff810a3da0>] ? __wake_up_sync+0x10/0x10
     [<ffffffff81116f0d>] shrink_inactive_list+0x37d/0x550
     [<ffffffff81117abb>] shrink_lruvec+0x52b/0x730
     [<ffffffffa01d42d7>] ? xfs_reclaim_inodes_count+0x37/0x50 [xfs]
     [<ffffffffa01d42d7>] ? xfs_reclaim_inodes_count+0x37/0x50 [xfs]
     [<ffffffff81117d54>] shrink_zone+0x94/0x1e0
     [<ffffffff811182c8>] do_try_to_free_pages+0x128/0x530
     [<ffffffff81118768>] try_to_free_pages+0x98/0xd0
     [<ffffffff8110e3f3>] __alloc_pages_nodemask+0x6e3/0xad0
     [<ffffffff8110e914>] alloc_kmem_pages_node+0x74/0x160
     [<ffffffff810617d5>] ? copy_process.part.32+0x125/0x1bb0
     [<ffffffff810617f6>] copy_process.part.32+0x146/0x1bb0
     [<ffffffff81094255>] ? sched_clock_cpu+0x85/0xc0
     [<ffffffff8109b5ac>] ? put_prev_entity+0x2c/0x2c0
     [<ffffffff8100c5c4>] ? __switch_to+0xf4/0x5a0
     [<ffffffff81079b80>] ? ____call_usermodehelper+0x1b0/0x1b0
     [<ffffffff815805db>] ? __schedule+0x29b/0x800
     [<ffffffff81063427>] do_fork+0xd7/0x340
     [<ffffffffa0079a2b>] ? mpt_fault_reset_work+0x9b/0x45c [mptbase]
     [<ffffffff810636b1>] kernel_thread+0x21/0x30
     [<ffffffff81079bf9>] __call_usermodehelper+0x29/0x90
     [<ffffffff8107d64f>] process_one_work+0x15f/0x3d0
     [<ffffffff8107de19>] worker_thread+0x119/0x620
     [<ffffffff8107dd00>] ? rescuer_thread+0x440/0x440
     [<ffffffff8108439c>] kthread+0xdc/0x100
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0
     [<ffffffff8158483c>] ret_from_fork+0x7c/0xb0
     [<ffffffff810842c0>] ? kthread_create_on_node+0x1a0/0x1a0

It seems to me that nobody was able to wake up kswapd. Therefore,
I think loops like

	while (unlikely(too_many_isolated(zone, file, sc))) {
		congestion_wait(BLK_RW_ASYNC, HZ/10);
	
		/* We are about to die and free our memory. Return now. */
		if (fatal_signal_pending(current))
			return SWAP_CLUSTER_MAX;
	}

which assume that somebody else shall wake up kswapd and kswapd shall perform
operations for making too_many_isolated() to return 0 is not good.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/