linux-kernel - Re: [patch 0/7] improve memcg oom killer robustness v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-Id: <20131011005945.33D49C21@pobox.sk>
Date:	Fri, 11 Oct 2013 00:59:45 +0200
From:	"azurIt" <azurit@...ox.sk>
To:	Johannes Weiner <hannes@...xchg.org>
Cc:	Michal Hocko <mhocko@...e.cz>,
	Andrew Morton <akpm@...ux-foundation.org>,
	David Rientjes <rientjes@...gle.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	<linux-mm@...ck.org>, <cgroups@...r.kernel.org>, <x86@...nel.org>,
	<linux-arch@...r.kernel.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2

>On Wed, Oct 09, 2013 at 08:44:50PM +0200, azurIt wrote:
>> Joahnnes,
>> 
>> i'm very sorry to say it but today something strange happened.. :) i was just right at the computer so i noticed it almost immediately but i don't have much info. Server stoped to respond from the net but i was already logged on ssh which was working quite fine (only a little slow). I was able to run commands on shell but i didn't do much because i was afraid that it will goes down for good soon. I noticed few things:
>>  - htop was strange because all CPUs were doing nothing (totally nothing)
>>  - there were enough of free memory
>>  - server load was about 90 and was raising slowly
>>  - i didn't see ANY process in 'run' state
>>  - i also didn't see any process with strange behavior (taking much CPU, memory or so) so it wasn't obvious what to do to fix it
>>  - i started to kill Apache processes, everytime i killed some, CPUs did some work, but it wasn't fixing the problem
>>  - finally i did 'skill -kill apache2' in shell and everything started to work
>>  - server monitoring wasn't sending any data so i have no graphs
>>  - nothing interesting in logs
>> 
>> I will send more info when i get some.
>
>Somebody else reported a problem on the upstream patches as well.  Any
>chance you can confirm the stacks of the active but not running tasks?



Unfortunately i don't have any stacks but i will try to take some next time.



>It sounds like they are stuck on a waitqueue, the question is which
>one.  I forgot to disable OOM for __GFP_NOFAIL allocations, so they
>could succeed and leak an OOM context.  task structs are not
>reinitialized between alloc & free so a different task could later try
>to oom trylock a memcg that has been freed, fail, and wait
>indefinitely on the OOM waitqueue.  There might be a simpler
>explanation but I can't think of anything right now.
>
>But the OOM context is definitely being leaked, so please apply the
>following for your next reboot:


It's installed, thank you!

azur
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/