linux-kernel - Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <bcbdc005-5114-9a6d-d304-81ac037012f2@samsung.com>
Date:   Wed, 24 Aug 2016 12:14:06 +0200
From:   Marcin Jabrzyk <m.jabrzyk@...sung.com>
To:     Sonny Rao <sonnyrao@...omium.org>, Michal Hocko <mhocko@...nel.org>
Cc:     Jann Horn <jann@...jh.net>,
        Robert Foss <robert.foss@...labora.com>,
        Jonathan Corbet <corbet@....net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Konstantin Khlebnikov <koct9i@...il.com>,
        Hugh Dickins <hughd@...gle.com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        Minchan Kim <minchan@...nel.org>,
        John Stultz <john.stultz@...aro.org>,
        ross.zwisler@...ux.intel.com, jmarchan@...hat.com,
        Johannes Weiner <hannes@...xchg.org>,
        Kees Cook <keescook@...omium.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Cyrill Gorcunov <gorcunov@...nvz.org>,
        Robin Humble <plaguedbypenguins@...il.com>,
        David Rientjes <rientjes@...gle.com>,
        eric.engestrom@...tec.com, Janis Danisevskis <jdanis@...gle.com>,
        calvinowens@...com, Alexey Dobriyan <adobriyan@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        ldufour@...ux.vnet.ibm.com, linux-doc@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Ben Zhang <benzh@...omium.org>,
        Bryan Freed <bfreed@...omium.org>,
        Filipe Brandenburger <filbranden@...omium.org>,
        Mateusz Guzik <mguzik@...hat.com>
Subject: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps



On 23/08/16 00:44, Sonny Rao wrote:
> On Mon, Aug 22, 2016 at 12:54 AM, Michal Hocko <mhocko@...nel.org> wrote:
>> On Fri 19-08-16 10:57:48, Sonny Rao wrote:
>>> On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@...nel.org> wrote:
>>>> On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>>>>> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@...nel.org> wrote:
>>>>>> On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>>>>>>> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@...nel.org> wrote:
>>>>>>>> On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>>>>>> [...]
>>>>>>>>> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>>>>>>>>> than let the kernel's OOM killer activate and need to gather this
>>>>>>>>> information and we'd like to be able to get this information to make
>>>>>>>>> the decision much faster than 400ms
>>>>>>>>
>>>>>>>> Global OOM handling in userspace is really dubious if you ask me. I
>>>>>>>> understand you want something better than SIGKILL and in fact this is
>>>>>>>> already possible with memory cgroup controller (btw. memcg will give
>>>>>>>> you a cheap access to rss, amount of shared, swapped out memory as
>>>>>>>> well). Anyway if you are getting close to the OOM your system will most
>>>>>>>> probably be really busy and chances are that also reading your new file
>>>>>>>> will take much more time. I am also not quite sure how is pss useful for
>>>>>>>> oom decisions.
>>>>>>>
>>>>>>> I mentioned it before, but based on experience RSS just isn't good
>>>>>>> enough -- there's too much sharing going on in our use case to make
>>>>>>> the correct decision based on RSS.  If RSS were good enough, simply
>>>>>>> put, this patch wouldn't exist.
>>>>>>
>>>>>> But that doesn't answer my question, I am afraid. So how exactly do you
>>>>>> use pss for oom decisions?
>>>>>
>>>>> We use PSS to calculate the memory used by a process among all the
>>>>> processes in the system, in the case of Chrome this tells us how much
>>>>> each renderer process (which is roughly tied to a particular "tab" in
>>>>> Chrome) is using and how much it has swapped out, so we know what the
>>>>> worst offenders are -- I'm not sure what's unclear about that?
>>>>
>>>> So let me ask more specifically. How can you make any decision based on
>>>> the pss when you do not know _what_ is the shared resource. In other
>>>> words if you select a task to terminate based on the pss then you have to
>>>> kill others who share the same resource otherwise you do not release
>>>> that shared resource. Not to mention that such a shared resource might
>>>> be on tmpfs/shmem and it won't get released even after all processes
>>>> which map it are gone.
>>>
>>> Ok I see why you're confused now, sorry.
>>>
>>> In our case that we do know what is being shared in general because
>>> the sharing is mostly between those processes that we're looking at
>>> and not other random processes or tmpfs, so PSS gives us useful data
>>> in the context of these processes which are sharing the data
>>> especially for monitoring between the set of these renderer processes.
>>
>> OK, I see and agree that pss might be useful when you _know_ what is
>> shared. But this sounds quite specific to a particular workload. How
>> many users are in a similar situation? In other words, if we present
>> a single number without the context, how much useful it will be in
>> general? Is it possible that presenting such a number could be even
>> misleading for somebody who doesn't have an idea which resources are
>> shared? These are all questions which should be answered before we
>> actually add this number (be it a new/existing proc file or a syscall).
>> I still believe that the number without wider context is just not all
>> that useful.
>
>
> I see the specific point about  PSS -- because you need to know what
> is being shared or otherwise use it in a whole system context, but I
> still think the whole system context is a valid and generally useful
> thing.  But what about the private_clean and private_dirty?  Surely
> those are more generally useful for calculating a lower bound on
> process memory usage without additional knowledge?
>
> At the end of the day all of these metrics are approximations, and it
> comes down to how far off the various approximations are and what
> trade offs we are willing to make.
> RSS is the cheapest but the most coarse.
>
> PSS (with the correct context) and Private data plus swap are much
> better but also more expensive due to the PT walk.
> As far as I know, to get anything but RSS we have to go through smaps
> or use memcg.  Swap seems to be available in /proc/<pid>/status.
>
> I looked at the "shared" value in /proc/<pid>/statm but it doesn't
> seem to correlate well with the shared value in smaps -- not sure why?
>
> It might be useful to show the magnitude of difference of using RSS vs
> PSS/Private in the case of the Chrome renderer processes.  On the
> system I was looking at there were about 40 of these processes, but I
> picked a few to give an idea:
>
> localhost ~ # cat /proc/21550/totmaps
> Rss:               98972 kB
> Pss:               54717 kB
> Shared_Clean:      19020 kB
> Shared_Dirty:      26352 kB
> Private_Clean:         0 kB
> Private_Dirty:     53600 kB
> Referenced:        92184 kB
> Anonymous:         46524 kB
> AnonHugePages:     24576 kB
> Swap:              13148 kB
>
>
> RSS is 80% higher than PSS and 84% higher than private data
>
> localhost ~ # cat /proc/21470/totmaps
> Rss:              118420 kB
> Pss:               70938 kB
> Shared_Clean:      22212 kB
> Shared_Dirty:      26520 kB
> Private_Clean:         0 kB
> Private_Dirty:     69688 kB
> Referenced:       111500 kB
> Anonymous:         79928 kB
> AnonHugePages:     24576 kB
> Swap:              12964 kB
>
> RSS is 66% higher than RSS and 69% higher than private data
>
> localhost ~ # cat /proc/21435/totmaps
> Rss:               97156 kB
> Pss:               50044 kB
> Shared_Clean:      21920 kB
> Shared_Dirty:      26400 kB
> Private_Clean:         0 kB
> Private_Dirty:     48836 kB
> Referenced:        90012 kB
> Anonymous:         75228 kB
> AnonHugePages:     24576 kB
> Swap:              13064 kB
>
> RSS is 94% higher than PSS and 98% higher than private data.
>
> It looks like there's a set of about 40MB of shared pages which cause
> the difference in this case.
> Swap was roughly even on these but I don't think it's always going to be true.
>
>

Sorry to hijack the thread, but I've found it recently
and I guess it's the best place to present our point.
We are working at our custom OS based on Linux and we also suffered much
by /proc/<pid>/smaps file. As in Chrome we tried to improve our internal
application memory management polices (Low Memory Killer) using data
provided by smaps but we failed due to very long time needed for reading
and parsing properly the file.

We've also observed that RSS measurement is often highly over PSS which
seems to be more real memory usage for process. Using smaps we would
be able to calculate USS usage and know exact minimum value of memory
that would be freed after terminating some process. Those are very
important sources of information as they give as the possibility to
provide best possible app life-cycle.

We have also tried to use smaps in some application for OS developers
as source of detailed information of memory usage of the system.
For checking possible ways of improvement we tried totmaps from earlier
version. On sample case for our app the CPU usage as presented by 'top'
decreases from ~60% to ~4.5% only by changing source from smpas to tomaps.

So we are also very interested in using interface such as totmaps as it
gives detailed and complete memory usage information for user-space and
in our case much of information provided by smaps is for us not useful
at all.

We are also using or tried using other interfaces like status, statm,
cgroups.memory etc. but still totmaps/smaps are still the best interface
to get all of the informations per process based in single place.

>>
>>> We also use the private clean and private dirty and swap fields to
>>> make a few metrics for the processes and charge each process for it's
>>> private, shared, and swap data. Private clean and dirty are used for
>>> estimating a lower bound on how much memory would be freed.
>>
>> I can imagine that this kind of information might be useful and
>> presented in /proc/<pid>/statm. The question is whether some of the
>> existing consumers would see the performance impact due to he page table
>> walk. Anyway even these counters might get quite tricky because even
>> shareable resources are considered private if the process is the only
>> one to map them (so again this might be a file on tmpfs...).
>>
>>> Swap and
>>> PSS also give us some indication of additional memory which might get
>>> freed up.
>> --
>> Michal Hocko
>> SUSE Labs
>
>

-- 
Marcin Jabrzyk
Samsung R&D Institute Poland
Samsung Electronics