linux-kernel - Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Fri, 08 Feb 2013 16:58:05 +0100
From:	"azurIt" <azurit@...ox.sk>
To:	Michal Hocko <mhocko@...e.cz>
Cc:	<linux-kernel@...r.kernel.org>, <linux-mm@...ck.org>,
	cgroups mailinglist <cgroups@...r.kernel.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Johannes Weiner <hannes@...xchg.org>
Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

>Which means that the oom killer didn't try to kill any task more than
>once which is good because it tells us that the killed task manages to
>die before we trigger oom again. So this is definitely not a deadlock.
>You are just hitting OOM very often.
>$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n
>      1 Task in /1091/uid killed as a result of limit of /1091
>      1 Task in /1223/uid killed as a result of limit of /1223
>      1 Task in /1229/uid killed as a result of limit of /1229
>      1 Task in /1255/uid killed as a result of limit of /1255
>      1 Task in /1424/uid killed as a result of limit of /1424
>      1 Task in /1470/uid killed as a result of limit of /1470
>      1 Task in /1567/uid killed as a result of limit of /1567
>      2 Task in /1080/uid killed as a result of limit of /1080
>      3 Task in /1381/uid killed as a result of limit of /1381
>      4 Task in /1185/uid killed as a result of limit of /1185
>      4 Task in /1289/uid killed as a result of limit of /1289
>      4 Task in /1709/uid killed as a result of limit of /1709
>      5 Task in /1279/uid killed as a result of limit of /1279
>      6 Task in /1020/uid killed as a result of limit of /1020
>      6 Task in /1527/uid killed as a result of limit of /1527
>      9 Task in /1388/uid killed as a result of limit of /1388
>     17 Task in /1281/uid killed as a result of limit of /1281
>     22 Task in /1599/uid killed as a result of limit of /1599
>     30 Task in /1155/uid killed as a result of limit of /1155
>     31 Task in /1258/uid killed as a result of limit of /1258
>     71 Task in /1293/uid killed as a result of limit of /1293
>
>So the group 1293 suffers the most. I would check how much memory the
>worklod in the group really needs because this level of OOM cannot
>possible be healthy.



I took the kernel log from yesterday from the same time frame:

$ grep "killed as a result of limit" kern2.log | sed 's@.*\] @@' | sort | uniq -c | sort -k1 -n
      1 Task in /1252/uid killed as a result of limit of /1252
      1 Task in /1709/uid killed as a result of limit of /1709
      2 Task in /1185/uid killed as a result of limit of /1185
      2 Task in /1388/uid killed as a result of limit of /1388
      2 Task in /1567/uid killed as a result of limit of /1567
      2 Task in /1650/uid killed as a result of limit of /1650
      3 Task in /1527/uid killed as a result of limit of /1527
      5 Task in /1552/uid killed as a result of limit of /1552
   1634 Task in /1258/uid killed as a result of limit of /1258

As you can see, there were much more OOM in '1258' and no such problems like this night (well, there were never such problems before :) ). As i said, cgroup 1258 were freezing every few minutes with your latest patch so there must be something wrong (it usually freezes about once per day). And it was really freezed (i checked that), the sypthoms were:
 - cannot strace any of cgroup processes
 - no new processes were started, still the same processes were 'running'
 - kernel was unable to resolve this by it's own
 - all processes togather were taking 100% CPU
 - the whole memory limit was used
(see memcg-bug-4.tar.gz for more info)
Unfortunately i forget to check if killing only few of the processes will resolve it (i always killed them all yesterday night). Don't know if is was in deadlock or not but kernel was definitely unable to resolve the problem. And there is still a mystery of two freezed processes which cannot be killed.

By the way, i KNOW that so much OOM is not healthy but the client simply don't want to buy more memory. He knows about the problem of unsufficient memory limit.

Thank you.


azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/