linux-kernel - Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAPz6YkUCvfXmhcP4XcUs-BfXDB=JbRrAwr9VsjLuJnkP+7mRPw@mail.gmail.com>
Date:   Fri, 19 Aug 2016 10:57:48 -0700
From:   Sonny Rao <sonnyrao@...omium.org>
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Jann Horn <jann@...jh.net>,
        Robert Foss <robert.foss@...labora.com>,
        Jonathan Corbet <corbet@....net>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Vlastimil Babka <vbabka@...e.cz>,
        Konstantin Khlebnikov <koct9i@...il.com>,
        Hugh Dickins <hughd@...gle.com>,
        Naoya Horiguchi <n-horiguchi@...jp.nec.com>,
        Minchan Kim <minchan@...nel.org>,
        John Stultz <john.stultz@...aro.org>,
        ross.zwisler@...ux.intel.com, jmarchan@...hat.com,
        Johannes Weiner <hannes@...xchg.org>,
        Kees Cook <keescook@...omium.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        Cyrill Gorcunov <gorcunov@...nvz.org>,
        Robin Humble <plaguedbypenguins@...il.com>,
        David Rientjes <rientjes@...gle.com>,
        eric.engestrom@...tec.com, Janis Danisevskis <jdanis@...gle.com>,
        calvinowens@...com, Alexey Dobriyan <adobriyan@...il.com>,
        "Kirill A. Shutemov" <kirill.shutemov@...ux.intel.com>,
        ldufour@...ux.vnet.ibm.com, linux-doc@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Ben Zhang <benzh@...omium.org>,
        Bryan Freed <bfreed@...omium.org>,
        Filipe Brandenburger <filbranden@...omium.org>,
        Mateusz Guzik <mguzik@...hat.com>
Subject: Re: [PACTH v2 0/3] Implement /proc/<pid>/totmaps

On Fri, Aug 19, 2016 at 12:59 AM, Michal Hocko <mhocko@...nel.org> wrote:
> On Thu 18-08-16 23:43:39, Sonny Rao wrote:
>> On Thu, Aug 18, 2016 at 11:01 AM, Michal Hocko <mhocko@...nel.org> wrote:
>> > On Thu 18-08-16 10:47:57, Sonny Rao wrote:
>> >> On Thu, Aug 18, 2016 at 12:44 AM, Michal Hocko <mhocko@...nel.org> wrote:
>> >> > On Wed 17-08-16 11:57:56, Sonny Rao wrote:
>> > [...]
>> >> >> 2) User space OOM handling -- we'd rather do a more graceful shutdown
>> >> >> than let the kernel's OOM killer activate and need to gather this
>> >> >> information and we'd like to be able to get this information to make
>> >> >> the decision much faster than 400ms
>> >> >
>> >> > Global OOM handling in userspace is really dubious if you ask me. I
>> >> > understand you want something better than SIGKILL and in fact this is
>> >> > already possible with memory cgroup controller (btw. memcg will give
>> >> > you a cheap access to rss, amount of shared, swapped out memory as
>> >> > well). Anyway if you are getting close to the OOM your system will most
>> >> > probably be really busy and chances are that also reading your new file
>> >> > will take much more time. I am also not quite sure how is pss useful for
>> >> > oom decisions.
>> >>
>> >> I mentioned it before, but based on experience RSS just isn't good
>> >> enough -- there's too much sharing going on in our use case to make
>> >> the correct decision based on RSS.  If RSS were good enough, simply
>> >> put, this patch wouldn't exist.
>> >
>> > But that doesn't answer my question, I am afraid. So how exactly do you
>> > use pss for oom decisions?
>>
>> We use PSS to calculate the memory used by a process among all the
>> processes in the system, in the case of Chrome this tells us how much
>> each renderer process (which is roughly tied to a particular "tab" in
>> Chrome) is using and how much it has swapped out, so we know what the
>> worst offenders are -- I'm not sure what's unclear about that?
>
> So let me ask more specifically. How can you make any decision based on
> the pss when you do not know _what_ is the shared resource. In other
> words if you select a task to terminate based on the pss then you have to
> kill others who share the same resource otherwise you do not release
> that shared resource. Not to mention that such a shared resource might
> be on tmpfs/shmem and it won't get released even after all processes
> which map it are gone.

Ok I see why you're confused now, sorry.

In our case that we do know what is being shared in general because
the sharing is mostly between those processes that we're looking at
and not other random processes or tmpfs, so PSS gives us useful data
in the context of these processes which are sharing the data
especially for monitoring between the set of these renderer processes.

We also use the private clean and private dirty and swap fields to
make a few metrics for the processes and charge each process for it's
private, shared, and swap data. Private clean and dirty are used for
estimating a lower bound on how much memory would be freed.  Swap and
PSS also give us some indication of additional memory which might get
freed up.

>
> I am sorry for being dense but it is still not clear to me how the
> single pss number can be used for oom or, in general, any serious
> decisions. The counter might be useful of course for debugging purposes
> or to have a general overview but then arguing about 40 vs 20ms sounds a
> bit strange to me.

Yeah so it's more than just the single PSS number, it's PSS,
Private_Clean, Private_dirty, Swap are all interesting numbers to make
these decisions.

>
>> Chrome tends to use a lot of shared memory so we found PSS to be
>> better than RSS, and I can give you examples of the  RSS and PSS on
>> real systems to illustrate the magnitude of the difference between
>> those two numbers if that would be useful.
>>
>> >
>> >> So even with memcg I think we'd have the same problem?
>> >
>> > memcg will give you instant anon, shared counters for all processes in
>> > the memcg.
>> >
>>
>> We want to be able to get per-process granularity quickly.  I'm not
>> sure if memcg provides that exactly?
>
> I will give you that information if you do process-per-memcg but that
> doesn't sound ideal. I thought those 20-something processes you were
> talking about are treated together but it seems I misunderstood.
> --
> Michal Hocko
> SUSE Labs