linux-kernel - Re: [PATCH v2] mm: memcontrol: protect the memory in cgroup from being oom killed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <395B1998-38A9-4A68-96F8-6EDF44686231@didiglobal.com>
Date:   Sat, 10 Dec 2022 09:18:39 +0000
From:   程垲涛 Chengkaitao Cheng 
        <chengkaitao@...iglobal.com>
To:     Michal Hocko <mhocko@...e.com>
CC:     chengkaitao <pilgrimtao@...il.com>,
        "tj@...nel.org" <tj@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "songmuchun@...edance.com" <songmuchun@...edance.com>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
        "ebiederm@...ssion.com" <ebiederm@...ssion.com>,
        "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
        "chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
        "haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "surenb@...gle.com" <surenb@...gle.com>,
        "sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
        "mcgrof@...nel.org" <mcgrof@...nel.org>,
        "sujiaxun@...ontech.com" <sujiaxun@...ontech.com>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH v2] mm: memcontrol: protect the memory in cgroup from
 being oom killed

At 2022-12-09 16:25:37, "Michal Hocko" <mhocko@...e.com> wrote:
>On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@...e.com> wrote:
>[...]
>> >oom killer is a memory reclaim of the last resort. So yes, there is some
>> >difference but fundamentally it is about releasing some memory. And long
>> >term we have learned that the more clever it tries to be the more likely
>> >corner cases can happen. It is simply impossible to know the best
>> >candidate so this is a just a best effort. We try to aim for
>> >predictability at least.
>> 
>> Is the current oom_score strategy predictable? I don't think so. The score_adj 
>> has broken the predictability of oom_score (it is no longer simply killing the 
>> process that uses the most mems).
>
>oom_score as reported to the userspace already considers oom_score_adj
>which means that you can compare processes and get a reasonable guess
>what would be the current oom_victim. There is a certain fuzz level
>because this is not atomic and also there is no clear candidate when
>multiple processes have equal score. 

Multiple processes have the same score, which means it is reasonable to kill 
any one. Why must we determine which one is?

> So yes, it is not 100% predictable.
>memory.reclaim as you propose doesn't change that though.
>
This scheme is to give the decision power of the candidate to the user. 
The user's behavior is random. I think it is impossible to 100% predict 
a random event.

Is it really necessary to make everything 100% predictable? Just as we can't 
accurately predict which cgroup will access the page cache frequently, 
we can't accurately predict whether the memory is hot or cold. These 
strategies are fuzzy, but we can't deny their rationality.

>Is oom_score_adj a good interface? No, not really. If I could go back in
>time I would nack it but here we are. We have an interface that
>promises quite much but essentially it only allows two usecases
>(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between
>is clumsy at best because a real user space oom policy would require to
>re-evaluate the whole oom domain (be it global or memcg oom) as the
>memory consumption evolves over time. I am really worried that your
>memory.oom.protection directs a very similar trajectory because
>protection really needs to consider other memcgs to balance properly.
>
The score_adj is an interface that promises quite much. I think the reason 
why only two usecases (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) 
are reliable is that user cannot evaluate the priority level of all processes in 
the physical machine. If there is a agent process in the physical machine, 
which can accurately divide all the user processes of the physical machine 
into different levels, other usecases of the score_adj will be well applied, 
but it is almost impossible to achieve in real life.

There is an example of the practical application
Kubelet will set the score_adj of dockerinit process of all burstabler containers, 
the setting specification follows the following formula,

score_adj = 1000 - request * 1000 / totalpages
(request = "Fixed coefficient" * "memory.max")

Because kubelet has a clear understanding of all the container memory behavior 
attributes in the physical machine, it can use more score_adj usecases. The 
advantage of the oom.protrct is that users do not need to have a clear understanding 
of all the processes in the physical machine, they only need to have a clear 
understanding of all the processes int local cgroup. I think the requirement is very 
easy to achieve.

>[...]
>
>> > But I am really open
>> >to be convinced otherwise and this is in fact what I have been asking
>> >for since the beginning. I would love to see some examples on the
>> >reasonable configuration for a practical usecase.
>> 
>> Here is a simple example. In a docker container, users can divide all processes 
>> into two categories (important and normal), and put them in different cgroups. 
>> One cgroup's oom.protect is set to "max", the other is set to "0". In this way, 
>> important processes in the container can be protected.
>
>That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all
>processes in the important group. I would argue you can achieve a very
>similar result by the process launcher to set the oom_score_adj and
>inherit it to all processes in that important container. You do not need
>any memcg tunable for that. 

Your method is not feasible. Please refer to the previous email
https://lore.kernel.org/linux-mm/E5A5BCC3-460E-4E81-8DD3-88B4A2868285@didiglobal.com/
* usecases 1: users say that they want to protect an important process 
* with high memory consumption from being killed by the oom in case 
* of docker container failure, so as to retain more critical on-site 
* information or a self recovery mechanism. At this time, they suggest 
* setting the score_adj of this process to -1000, but I don't agree with 
* it, because the docker container is not important to other docker 
* containers of the same physical machine. If score_adj of the process 
* is set to -1000, the probability of oom in other container processes will 
* increase.

>I am really much more interested in examples
>when the protection is to be fine tuned.
-- 
Thanks for your comment!
chengkaitao