linux-kernel - Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <66F9BB37-3BE1-4B0F-8DE1-97085AF4BED2@didiglobal.com>
Date:   Mon, 8 May 2023 09:08:25 +0000
From:   程垲涛 Chengkaitao Cheng 
        <chengkaitao@...iglobal.com>
To:     Michal Hocko <mhocko@...e.com>
CC:     "tj@...nel.org" <tj@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "brauner@...nel.org" <brauner@...nel.org>,
        "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
        "ebiederm@...ssion.com" <ebiederm@...ssion.com>,
        "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
        "chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
        "pilgrimtao@...il.com" <pilgrimtao@...il.com>,
        "haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "surenb@...gle.com" <surenb@...gle.com>,
        "sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
        "mcgrof@...nel.org" <mcgrof@...nel.org>,
        "sujiaxun@...ontech.com" <sujiaxun@...ontech.com>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@...e.com> wrote:
>On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> Establish a new OOM score algorithm, supports the cgroup level OOM
>> protection mechanism. When an global/memcg oom event occurs, we treat
>> all processes in the cgroup as a whole, and OOM killers need to select
>> the process to kill based on the protection quota of the cgroup
>
>Although your patch 1 briefly touches on some advantages of this
>interface there is a lack of actual usecase. Arguing that oom_score_adj
>is hard because it needs a parent process is rather weak to be honest.
>It is just trivial to create a thin wrapper, use systemd to launch
>important services or simply update the value after the fact. Now
>oom_score_adj has its own downsides of course (most notably a
>granularity and a lack of group protection.
>
>That being said, make sure you describe your usecase more thoroughly.
>Please also make sure you describe the intended heuristic of the knob.
>It is not really clear from the description how this fits hierarchical
>behavior of cgroups. I would be especially interested in the semantics
>of non-leaf memcgs protection as they do not have any actual processes
>to protect.
>
>Also there have been concerns mentioned in v2 discussion and it would be
>really appreciated to summarize how you have dealt with them.
>
>Please also note that many people are going to be slow in responding
>this week because of LSFMM conference
>(https://events.linuxfoundation.org/lsfmm/)

Here is a more detailed comparison and introduction of the old oom_score_adj
mechanism and the new oom_protect mechanism,
1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
2. It may be simple to create a lightweight parent process and uniformly set the 
oom_score_adj of some important processes, but it is not a simple matter to make 
multi-level settings for tens of thousands of processes on the physical machine 
through the lightweight parent processes. We may need a huge table to record the 
value of oom_score_adj maintained by all lightweight parent processes, and the 
user process limited by the parent process has no ability to change its own 
oom_score_adj, because it does not know the details of the huge table. The new 
patch adopts the cgroup mechanism. It does not need any parent process to manage 
oom_score_adj. the settings between each memcg are independent of each other, 
making it easier to plan the OOM order of all processes. Due to the unique nature 
of memory resources, current Service cloud vendors are not oversold in memory 
planning. I would like to use the new patch to try to achieve the possibility of 
oversold memory resources.
3. I conducted a test and deployed an excessive number of containers on a physical 
machine, By setting the oom_score_adj value of all processes in the container to 
a positive number through dockerinit, even processes that occupy very little memory 
in the container are easily killed, resulting in a large number of invalid kill behaviors. 
If dockerinit is also killed unfortunately, it will trigger container self-healing, and the 
container will rebuild, resulting in more severe memory oscillations. The new patch 
abandons the behavior of adding an equal amount of oom_score_adj to each process 
in the container and adopts a shared oom_protect quota for all processes in the container. 
If a process in the container is killed, the remaining other processes will receive more 
oom_protect quota, making it more difficult for the remaining processes to be killed.
In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
4. oom_score_adj is a global configuration that cannot achieve a kill order that only 
affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits 
downwards, and user can only change the kill order of its own memcg oom, but the 
kill order of their parent memcg-oom-killer or global-oom-killer will not be affected

In the final discussion of patch v2, we discussed that although the adjustment range 
of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
clumsy at best. In order to solve this problem in the new patch, I introduced a new 
indicator oom_kill_inherit, which counts the number of times the local and child 
cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
value of oom_protect to achieve the best.

about the semantics of non-leaf memcgs protection,
If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
calculate the new effective oom_protect quota based on non-leaf memcg's quota.

-- 
Thanks for your comment!
chengkaitao