linux-kernel - Re: [EDT] oom_killer: find bulkiest task based on pss value

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1505102017170.14971@chino.kir.corp.google.com>
Date:	Sun, 10 May 2015 20:25:47 -0700 (PDT)
From:	David Rientjes <rientjes@...gle.com>
To:	Yogesh Narayan Gaur <yn.gaur@...sung.com>
cc:	akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
	"ajeet.y@...sung.com" <ajeet.y@...sung.com>, amit.arora@...sung.com
Subject: Re: [EDT] oom_killer: find bulkiest task based on pss value

On Fri, 8 May 2015, Yogesh Narayan Gaur wrote:

> Presently in oom_kill.c we calculate badness score of the victim task as per the present RSS counter value of the task.
> RSS counter value for any task is usually '[Private (Dirty/Clean)] + [Shared (Dirty/Clean)]' of the task.
> We have encountered a situation where values for Private fields are less but value for Shared fields are more and hence make total RSS counter value large. Later on oom situation killing task with highest RSS value but as Private field values are not large hence memory gain after killing this process is not as per the expectation.
> 
> For e.g. take below use-case scenario, in which 3 process are running in system. 
> All these process done mmap for file exist in present directory and then copying data from this file to local allocated pointers in while(1) loop with some sleep. Out of 3 process, 2 process has mmaped file with MAP_SHARED setting and one has mapped file with MAP_PRIVATE setting.
> I have all 3 processes in background and checks RSS/PSS value from user space utility (utility over cat /proc/pid/smaps)
> Before OOM, below is the consumed memory status for these 3 process (all processes run with oom_score_adj = 0)
> ====================================================
> Comm : 1prg,  Pid : 213 (values in kB)
>                       Rss     Shared      Private          Pss
>   Process :  375764    194596    181168     278460
> ====================================================
> Comm : 3prg,  Pid : 217 (values in kB)
>                       Rss    Shared       Private         Pss
>   Process :  305760          32     305728    305738
> ====================================================
> Comm : 2prg,  Pid : 218 (values in kB)
>                       Rss      Shared       Private         Pss
>   Process :  389980     194596     195384    292676
> ====================================================
> 
> Thus as per present code design, first it would select process [2prg : 218] as bulkiest process as its RSS value is highest to kill. But if we kill this process then only ~195MB would be free as compare to expected ~389MB.
> Thus identifying the task based on RSS value is not accurate design and killing that identified process didn’t release expected memory back to system.
> 
> We need to calculate victim task based on PSS instead of RSS as PSS value calculates as
> PSS value = [Private (Dirty/Clean)] + [Shared (Dirty/Clean) / no. of shared task]
> For above use-case scenario also, it can be checked that process [3prg : 217] is having largest PSS value and by killing this process we can gain maximum memory (~305MB) as compare to killing process identified based on RSS value.
> 

The oom killer doesn't expect to necessarily be able to free all memory 
that is represented by the rss of a process.  In fact, after it selects a 
process it will happily kill a child process in favor of its parent if 
they don't share the same memory.

There're a few problems with using pss and the proposed patch that 
follows:

 - it's less predictable since it depends on the number of times the 
   memory is mapped, which may change during the process's lifetime,

 - it requires mm->mmap_sem to do, which is not possible to do because
   it may be held and thus reverting back to rss in situations where
   the trylock fails makes it even less predictable and reliable, and

 - all users who currently tune /proc/pid/oom_score_adj or
   /proc/pid/oom_adj are doing so based on the current heuristic, which
   is rss; if we switched to pss and all a process's memory is shared
   then their oom_score_adj or oom_adj is now severely broken (and as a
   result of the first problem above, defining oom_score_adj is near
   impossible).

We don't have the expectation of freeing the entire rss, the best we can 
do is use a heuristic which is reliable, consistent, and cheap to check.  
We can then ask users who desire a process to have a different oom kill 
priority to use oom_score_adj and they may do so in a reliable way without 
having the fallback behavior that your trylock does.