linux-kernel - Re: [patch 0/7] improve memcg oom killer robustness v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20130918180455.GD856@cmpxchg.org>
Date:	Wed, 18 Sep 2013 14:04:55 -0400
From:	Johannes Weiner <hannes@...xchg.org>
To:	azurIt <azurit@...ox.sk>
Cc:	Michal Hocko <mhocko@...e.cz>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Rientjes <rientjes@...gle.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	linux-mm@...ck.org, cgroups@...r.kernel.org, x86@...nel.org,
	linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2

On Wed, Sep 18, 2013 at 04:03:04PM +0200, azurIt wrote:
> > CC: "Johannes Weiner" <hannes@...xchg.org>, "Andrew Morton" <akpm@...ux-foundation.org>, "David Rientjes" <rientjes@...gle.com>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@...fujitsu.com>, "KOSAKI Motohiro" <kosaki.motohiro@...fujitsu.com>, linux-mm@...ck.org, cgroups@...r.kernel.org, x86@...nel.org, linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org
> >On Tue 17-09-13 13:15:35, azurIt wrote:
> >[...]
> >> Is something unusual on this stack?
> >> 
> >> 
> >> [<ffffffff810d1a5e>] dump_header+0x7e/0x1e0
> >> [<ffffffff810d195f>] ? find_lock_task_mm+0x2f/0x70
> >> [<ffffffff810d1f25>] oom_kill_process+0x85/0x2a0
> >> [<ffffffff810d24a8>] mem_cgroup_out_of_memory+0xa8/0xf0
> >> [<ffffffff8110fb76>] mem_cgroup_oom_synchronize+0x2e6/0x310
> >> [<ffffffff8110efc0>] ? mem_cgroup_uncharge_page+0x40/0x40
> >> [<ffffffff810d2703>] pagefault_out_of_memory+0x13/0x130
> >> [<ffffffff81026f6e>] mm_fault_error+0x9e/0x150
> >> [<ffffffff81027424>] do_page_fault+0x404/0x490
> >> [<ffffffff810f952c>] ? do_mmap_pgoff+0x3dc/0x430
> >> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> >
> >This is a regular memcg OOM killer. Which dumps messages about what is
> >going to do. So no, nothing unusual, except if it was like that for ever
> >which would mean that oom_kill_process is in the endless loop. But a
> >single stack doesn't tell us much.
> >
> >Just a note. When you see something hogging a cpu and you are not sure
> >whether it might be in an endless loop inside the kernel it makes sense
> >to take several snaphosts of the stack trace and see if it changes. If
> >not and the process is not sleeping (there is no schedule on the trace)
> >then it might be looping somewhere waiting for Godot. If it is sleeping
> >then it is slightly harder because you would have to identify what it is
> >waiting for which requires to know a deeper context.
> >-- 
> >Michal Hocko
> >SUSE Labs
> 
> 
> 
> I was finally able to get stack of problematic process :) I saved it two times from the same process, as Michal suggested (i wasn't able to take more). Here it is:
> 
> First (doesn't look very helpfull):
> [<ffffffffffffffff>] 0xffffffffffffffff
> 
> 
> Second:
> [<ffffffff810e17d1>] shrink_zone+0x481/0x650
> [<ffffffff810e2ade>] do_try_to_free_pages+0xde/0x550
> [<ffffffff810e310b>] try_to_free_pages+0x9b/0x120
> [<ffffffff81148ccd>] free_more_memory+0x5d/0x60
> [<ffffffff8114931d>] __getblk+0x14d/0x2c0
> [<ffffffff8114c973>] __bread+0x13/0xc0
> [<ffffffff811968a8>] ext3_get_branch+0x98/0x140
> [<ffffffff81197497>] ext3_get_blocks_handle+0xd7/0xdc0
> [<ffffffff81198244>] ext3_get_block+0xc4/0x120
> [<ffffffff81155b8a>] do_mpage_readpage+0x38a/0x690
> [<ffffffff81155ffb>] mpage_readpages+0xfb/0x160
> [<ffffffff811972bd>] ext3_readpages+0x1d/0x20
> [<ffffffff810d9345>] __do_page_cache_readahead+0x1c5/0x270
> [<ffffffff810d9411>] ra_submit+0x21/0x30
> [<ffffffff810cfb90>] filemap_fault+0x380/0x4f0
> [<ffffffff810ef908>] __do_fault+0x78/0x5a0
> [<ffffffff810f2b24>] handle_pte_fault+0x84/0x940
> [<ffffffff810f354a>] handle_mm_fault+0x16a/0x320
> [<ffffffff8102715b>] do_page_fault+0x13b/0x490
> [<ffffffff815cb87f>] page_fault+0x1f/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

Ah, crap.  I'm sorry.  You even showed us this exact trace before in
another context, but I did not fully realize what __getblk() is doing.

My subsequent patches made a charge attempt return -ENOMEM without
reclaim if the memcg is under OOM.  And so the reason you have these
reclaim livelocks is because __getblk never fails on -ENOMEM.  When
the allocation returns -ENOMEM, it invokes GLOBAL DIRECT RECLAIM and
tries again in an endless loop.  The memcg code would previously just
loop inside the charge, reclaiming and killing, until the allocation
succeeded.  But the new code relies on the fault stack being unwound
to complete the OOM kill.  And since the stack is not unwound with
__getblk() looping around the allocation there is no more memcg
reclaim AND no memcg OOM kill, thus no chance of exiting.

That code is weird but really old, so it may take a while to evaluate
all the callers as to whether this can be changed.

In the meantime, I would just allow __getblk to bypass the memcg limit
when it still can't charge after reclaim.  Does the below get your
machine back on track?

---

diff --git a/fs/buffer.c b/fs/buffer.c
index 19d8eb7..83c8716 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1085,6 +1085,8 @@ grow_buffers(struct block_device *bdev, sector_t block, int size)
 static struct buffer_head *
 __getblk_slow(struct block_device *bdev, sector_t block, int size)
 {
+	struct buffer_head *bh = NULL;
+
 	/* Size must be multiple of hard sectorsize */
 	if (unlikely(size & (bdev_logical_block_size(bdev)-1) ||
 			(size < 512 || size > PAGE_SIZE))) {
@@ -1097,20 +1099,23 @@ __getblk_slow(struct block_device *bdev, sector_t block, int size)
 		return NULL;
 	}
 
+	mem_cgroup_oom_enable();
 	for (;;) {
-		struct buffer_head * bh;
 		int ret;
 
 		bh = __find_get_block(bdev, block, size);
 		if (bh)
-			return bh;
+			break;
 
 		ret = grow_buffers(bdev, block, size);
 		if (ret < 0)
-			return NULL;
+			break;
 		if (ret == 0)
 			free_more_memory();
 	}
+	mem_cgroup_oom_disable();
+	mem_cgroup_oom_synchronize(false);
+	return bh;
 }
 
 /*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 325da07..e441647 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,16 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
 extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 					struct task_struct *p);
 
-static inline void mem_cgroup_enable_oom(void)
+static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 1;
+	current->memcg_oom.may_oom++;
 }
 
-static inline void mem_cgroup_disable_oom(void)
+static inline void mem_cgroup_oom_disable(void)
 {
 	WARN_ON(!current->memcg_oom.may_oom);
-	current->memcg_oom.may_oom = 0;
+	current->memcg_oom.may_oom--;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
@@ -352,11 +351,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
 {
 }
 
-static inline void mem_cgroup_enable_oom(void)
+static inline void mem_cgroup_oom_enable(void)
 {
 }
 
-static inline void mem_cgroup_disable_oom(void)
+static inline void mem_cgroup_oom_disable(void)
 {
 }
 
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fb1f145..dc71a17 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1571,7 +1571,7 @@ struct task_struct {
 	struct memcg_oom_info {
 		struct mem_cgroup *memcg;
 		gfp_t gfp_mask;
-		unsigned int may_oom:1;
+		unsigned int may_oom;
 	} memcg_oom;
 #endif
 #ifdef CONFIG_HAVE_HW_BREAKPOINT
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f565857..1441fc5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1878,7 +1878,6 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
 	 */
 	css_get(&memcg->css);
 	current->memcg_oom.memcg = memcg;
-	mem_cgroup_mark_under_oom(memcg);
 	current->memcg_oom.gfp_mask = mask;
 }
 
@@ -1930,6 +1929,7 @@ bool mem_cgroup_oom_synchronize(bool handle)
 	 * under OOM is always welcomed, use TASK_KILLABLE here.
 	 */
 	prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+	mem_cgroup_mark_under_oom(memcg);
 
 	locked = mem_cgroup_oom_trylock(memcg);
 
@@ -1937,10 +1937,12 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		mem_cgroup_oom_notify(memcg);
 
 	if (locked && !memcg->oom_kill_disable) {
+		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 		mem_cgroup_out_of_memory(memcg, current->memcg_oom.gfp_mask);
 	} else {
 		schedule();
+		mem_cgroup_unmark_under_oom(memcg);
 		finish_wait(&memcg_oom_waitq, &owait.wait);
 	}
 
@@ -1954,7 +1956,6 @@ bool mem_cgroup_oom_synchronize(bool handle)
 		memcg_oom_recover(memcg);
 	}
 cleanup:
-	mem_cgroup_unmark_under_oom(memcg);
 	current->memcg_oom.memcg = NULL;
 	css_put(&memcg->css);
 	return true;
@@ -2340,10 +2341,11 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
 		goto bypass;
 
 	/*
-	 * Task already OOMed, just get out of here.
+	 * Task already OOMed, just allow it to finish the fault as
+	 * quickly as possible to start the OOM handling.
 	 */
 	if (unlikely(current->memcg_oom.memcg))
-		goto nomem;
+		goto bypass;
 
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
@@ -2417,9 +2419,6 @@ again:
 		if (oom && !nr_reclaim_retries)
 			enter_oom = true;
 
-		if (atomic_read(&memcg->under_oom))
-			enter_oom = true;
-
 		ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, enter_oom);
 		switch (ret) {
 		case CHARGE_OK:
diff --git a/mm/memory.c b/mm/memory.c
index 20c43a0..3d82ef9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3513,12 +3513,12 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 	 * space.  Kernel faults are handled more gracefully.
 	 */
 	if (flags & FAULT_FLAG_USER)
-		mem_cgroup_enable_oom();
+		mem_cgroup_oom_enable();
 
 	ret = __handle_mm_fault(mm, vma, address, flags);
 
 	if (flags & FAULT_FLAG_USER) {
-		mem_cgroup_disable_oom();
+		mem_cgroup_oom_disable();
 		/*
 		 * The task may have entered a memcg OOM situation but
 		 * if the allocation error was handled gracefully (no
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/