linux-kernel - Re: Memory overcommit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <4AE846E8.1070303@gmail.com>
Date:	Wed, 28 Oct 2009 14:28:08 +0100
From:	Vedran Furač <vedran.furac@...il.com>
To:	David Rientjes <rientjes@...gle.com>
CC:	Hugh Dickins <hugh.dickins@...cali.co.uk>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	linux-mm@...ck.org, linux-kernel@...r.kernel.org,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	minchan.kim@...il.com, Andrew Morton <akpm@...ux-foundation.org>,
	Andrea Arcangeli <aarcange@...hat.com>
Subject: Re: Memory overcommit

David Rientjes wrote:

> On Wed, 28 Oct 2009, Vedran Furac wrote:
> 
>>> This is wrong; it doesn't "emulate oom" since oom_kill_process() always 
>>> kills a child of the selected process instead if they do not share the 
>>> same memory.  The chosen task in that case is untouched.
>> OK, I stand corrected then. Thanks! But, while testing this I lost X
>> once again and "test" survived for some time (check the timestamps):
>>
>> http://pastebin.com/d5c9d026e
>>
>> - It started by killing gkrellm(!!!)
>> - Then I lost X (kdeinit4 I guess)
>> - Then 103 seconds after the killing started, it killed "test" - the
>> real culprit.
>>
>> I mean... how?!
>>
> 
> Here are the five oom kills that occurred in your log, and notice that the 
> first four times it kills a child and not the actual task as I explained:

Yes, but four times wrong.

> Those are practically happening simultaneously with very little memory 
> being available between each oom kill.  Only later is "test" killed:
> 
> [97240.203228] Out of memory: kill process 5005 (test) score 256912 or a child
> [97240.206832] Killed process 5005 (test)
> 
> Notice how the badness score is less than 1/4th of the others.  So while 
> you may find it to be hogging a lot of memory, there were others that 
> consumed much more.
^^^^^^^^^^^^^^^^^^^^^

This is just wrong. I have 3.5GB of RAM, free says that 2GB are empty
(ignoring cache). Culprit then allocates all free memory (2GB). That
means it is using *more* than all other processes *together*. There
cannot be any other "that consumed much more".

> You can get a more detailed understanding of this by doing
> 
> 	echo 1 > /proc/sys/vm/oom_dump_tasks
> 
> before trying your testcase; it will show various information like the 
> total_vm

Looking at total_vm (VIRT in top/vsize in ps?) is completely wrong. If I
sum up those numbers for every process running I would get:

%ps -eo pid,vsize,command|awk '{ SUM += $2} END {print SUM/1024/1024}'
14.7935

14GB. And I only have 3GB. I usually use exmap to get realistic numbers:

http://www.berthels.co.uk/exmap/doc.html

> and oom_adj value for each task at the time of oom (and the 
> actual badness score is exported per-task via /proc/pid/oom_score in 
> real-time).  This will also include the rss and show what the end result 
> would be in using that value as part of the heuristic on this particular 
> workload compared to the current implementation.

Thanks, I'll try that... but I guess that using rss would yield better
results.


Regards,

Vedran
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/