linux-kernel - Re: [v8 0/4] cgroup-aware OOM killer

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CALvZod7iaOEeGmDJA0cZvJWpuzc-hMRn3PG2cfzcMniJtAjKqA@mail.gmail.com>
Date:   Sun, 1 Oct 2017 16:29:48 -0700
From:   Shakeel Butt <shakeelb@...gle.com>
To:     Tim Hockin <thockin@...kin.org>
Cc:     Roman Gushchin <guro@...com>, Michal Hocko <mhocko@...nel.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Tejun Heo <tj@...nel.org>, kernel-team@...com,
        David Rientjes <rientjes@...gle.com>,
        Linux MM <linux-mm@...ck.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Tetsuo Handa <penguin-kernel@...ove.sakura.ne.jp>,
        Andrew Morton <akpm@...ux-foundation.org>,
        Cgroups <cgroups@...r.kernel.org>, linux-doc@...r.kernel.org,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>
Subject: Re: [v8 0/4] cgroup-aware OOM killer

>
> Going back to Michal's example, say the user configured the following:
>
>        root
>       /    \
>      A      D
>     / \
>    B   C
>
> A global OOM event happens and we find this:
> - A > D
> - B, C, D are oomgroups
>
> What the user is telling us is that B, C, and D are compound memory
> consumers. They cannot be divided into their task parts from a memory
> point of view.
>
> However, the user doesn't say the same for A: the A subtree summarizes
> and controls aggregate consumption of B and C, but without groupoom
> set on A, the user says that A is in fact divisible into independent
> memory consumers B and C.
>
> If we don't have to kill all of A, but we'd have to kill all of D,
> does it make sense to compare the two?
>

I think Tim has given very clear explanation why comparing A & D makes
perfect sense. However I think the above example, a single user system
where a user has designed and created the whole hierarchy and then
attaches different jobs/applications to different nodes in this
hierarchy, is also a valid scenario. One solution I can think of, to
cater both scenarios, is to introduce a notion of 'bypass oom' or not
include a memcg for oom comparision and instead include its children
in the comparison.

So, in the same above example:
        root
       /       \
      A(b)    D
     /  \
    B   C

A is marked as bypass and thus B and C are to be compared to D. So,
for the single user scenario, all the internal nodes are marked
'bypass oom comparison' and oom_priority of the leaves has to be set
to the same value.

Below is the pseudo code of select_victim_memcg() based on this idea
and David's previous pseudo code. The calculation of size of a memcg
is still not very well baked here yet. I am working on it and I plan
to have a patch based on Roman's v9 "mm, oom: cgroup-aware OOM killer"
patch.


        struct mem_cgroup *memcg = root_mem_cgroup;
        struct mem_cgroup *selected_memcg = root_mem_cgroup;
        struct mem_cgroup *low_memcg;
        unsigned long low_priority;
        unsigned long prev_badness = memcg_oom_badness(memcg); // Roman's code
        LIST_HEAD(queue);

next_level:
        low_memcg = NULL;
        low_priority = ULONG_MAX;

next:
        for_each_child_of_memcg(it, memcg) {
                unsigned long prio = it->oom_priority;
                unsigned long badness = 0;

                if (it->bypass_oom && !it->oom_group &&
memcg_has_children(it)) {
                        list_add(&it->oom_queue, &queue);
                        continue;
                }

                if (prio > low_priority)
                        continue;

                if (prio == low_priority) {
                        badness = mem_cgroup_usage(it); // for
simplicity, need more thinking
                        if (badness < prev_badness)
                                continue;
                }

                low_memcg = it;
                low_priority = prio;
                prev_badness = badness ?: mem_cgroup_usage(it);  //
for simplicity
        }
        if (!list_empty(&queue)) {
                memcg = list_last_entry(&queue, struct mem_cgroup, oom_queue);
                list_del(&memcg->oom_queue);
                goto next;
        }
        if (low_memcg) {
                selected_memcg = memcg = low_memcg;
                prev_badness = 0;
                if (!low_memcg->oom_group)
                        goto next_level;
        }
        if (selected_memcg->oom_group)
                oom_kill_memcg(selected_memcg);
        else
                oom_kill_process_from_memcg(selected_memcg);