[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date: Wed, 9 Oct 2013 20:14:22 -0400
From: Johannes Weiner <hannes@...xchg.org>
To: azurIt <azurit@...ox.sk>
Cc: Michal Hocko <mhocko@...e.cz>,
Andrew Morton <akpm@...ux-foundation.org>,
David Rientjes <rientjes@...gle.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
linux-mm@...ck.org, cgroups@...r.kernel.org, x86@...nel.org,
linux-arch@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2
Hi azur,
On Wed, Oct 09, 2013 at 08:44:50PM +0200, azurIt wrote:
> Joahnnes,
>
> i'm very sorry to say it but today something strange happened.. :) i was just right at the computer so i noticed it almost immediately but i don't have much info. Server stoped to respond from the net but i was already logged on ssh which was working quite fine (only a little slow). I was able to run commands on shell but i didn't do much because i was afraid that it will goes down for good soon. I noticed few things:
> - htop was strange because all CPUs were doing nothing (totally nothing)
> - there were enough of free memory
> - server load was about 90 and was raising slowly
> - i didn't see ANY process in 'run' state
> - i also didn't see any process with strange behavior (taking much CPU, memory or so) so it wasn't obvious what to do to fix it
> - i started to kill Apache processes, everytime i killed some, CPUs did some work, but it wasn't fixing the problem
> - finally i did 'skill -kill apache2' in shell and everything started to work
> - server monitoring wasn't sending any data so i have no graphs
> - nothing interesting in logs
>
> I will send more info when i get some.
Somebody else reported a problem on the upstream patches as well. Any
chance you can confirm the stacks of the active but not running tasks?
It sounds like they are stuck on a waitqueue, the question is which
one. I forgot to disable OOM for __GFP_NOFAIL allocations, so they
could succeed and leak an OOM context. task structs are not
reinitialized between alloc & free so a different task could later try
to oom trylock a memcg that has been freed, fail, and wait
indefinitely on the OOM waitqueue. There might be a simpler
explanation but I can't think of anything right now.
But the OOM context is definitely being leaked, so please apply the
following for your next reboot:
---
mm/memcontrol.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5aee2fa..83ad39b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2341,6 +2341,9 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
*/
if (!*ptr && !mm)
goto bypass;
+
+ if (gfp_mask & __GFP_NOFAIL)
+ oom = false;
again:
if (*ptr) { /* css should be a valid one */
memcg = *ptr;
--
1.8.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists