linux-kernel - Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <96BFCF52-A5F6-4B73-ACAE-ACF11798E374@didiglobal.com>
Date:   Thu, 25 May 2023 07:35:41 +0000
From:   程垲涛 Chengkaitao Cheng 
        <chengkaitao@...iglobal.com>
To:     Michal Hocko <mhocko@...e.com>
CC:     "tj@...nel.org" <tj@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "brauner@...nel.org" <brauner@...nel.org>,
        "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
        "ebiederm@...ssion.com" <ebiederm@...ssion.com>,
        "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
        "chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
        "pilgrimtao@...il.com" <pilgrimtao@...il.com>,
        "haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "surenb@...gle.com" <surenb@...gle.com>,
        "sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
        "mcgrof@...nel.org" <mcgrof@...nel.org>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@...e.com> wrote:
>[Sorry for a late reply but I was mostly offline last 2 weeks]
>
>On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@...e.com> wrote:
>[...]
>> >Your cover letter mentions that then "all processes in the cgroup as a
>> >whole". That to me reads as oom.group oom killer policy. But a brief
>> >look into the patch suggests you are still looking at specific tasks and
>> >this has been a concern in the previous version of the patch because
>> >memcg accounting and per-process accounting are detached.
>> 
>> I think the memcg accounting may be more reasonable, as its memory 
>> statistics are more comprehensive, similar to active page cache, which 
>> also increases the probability of OOM-kill. In the new patch, all the 
>> shared memory will also consume the oom_protect quota of the memcg, 
>> and the process's oom_protect quota of the memcg will decrease.
>
>I am sorry but I do not follow. Could you elaborate please? Are you
>arguing for per memcg or per process metrics?

You mentioned earlier that 'memcg accounting and per process accounting
are detached', and I may have misunderstood your question. I want to 
express above that memcg accounting is more comprehensive than per process 
accounting, and using memcg accounting in the OOM-killer mechanism would 
be more reasonable.

>[...]
>
>> >> In the final discussion of patch v2, we discussed that although the adjustment range 
>> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
>> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
>> >> indicator oom_kill_inherit, which counts the number of times the local and child 
>> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
>> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
>> >> value of oom_protect to achieve the best.
>> >
>> >What does the best mean in this context?
>> 
>> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
>> with memory.oom.protect, so we have a ruler to measure the optimal value of 
>> memory.oom.protect.
>
>An example might help here.

In my testing case, by adjusting memory.oom.protect, I was able to significantly 
reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine 
with severely oversold memory, I divided all cgroups into three categories and 
controlled their probability of being selected by the oom-killer to 0%,% 20, 
and 80%, respectively.

>> >> about the semantics of non-leaf memcgs protection,
>> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
>> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>> >
>> >So the non-leaf memcg is never used as a target? What if the workload is
>> >distributed over several sub-groups? Our current oom.group
>> >implementation traverses the tree to find a common ancestor in the oom
>> >domain with the oom.group.
>> 
>> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
>> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
>> be proportionally reduced
>> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
>> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
>> will be proportionally increased
>> The purpose of doing so is that users can set oom_protect quota according to 
>> their own needs, and the system management process can set appropriate 
>> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
>> the system management process can indirectly manage all user processes.
>
>I guess that you are trying to say that the oom protection has a
>standard hierarchical behavior. And that is fine, well, in fact it is
>mandatory for any control knob to have a sane hierarchical properties.
>But that doesn't address my above question. Let me try again. When is a
>non-leaf memcg potentially selected as the oom victim? It doesn't have
>any tasks directly but it might be a suitable target to kill a multi
>memcg based workload (e.g. a full container).

If nonleaf memcg have the higher memory usage and the smaller 
memory.oom.protect, it will have the higher the probability being 
selected by the killer. If the non-leaf memcg is selected as the oom 
victim, OOM-killer will continue to select the appropriate child 
memcg downwards until the leaf memcg is selected.

>> >All that being said and with the usecase described more specifically. I
>> >can see that memcg based oom victim selection makes some sense. That
>> >menas that it is always a memcg selected and all tasks withing killed.
>> >Memcg based protection can be used to evaluate which memcg to choose and
>> >the overall scheme should be still manageable. It would indeed resemble
>> >memory protection for the regular reclaim.
>> >
>> >One thing that is still not really clear to me is to how group vs.
>> >non-group ooms could be handled gracefully. Right now we can handle that
>> >because the oom selection is still process based but with the protection
>> >this will become more problematic as explained previously. Essentially
>> >we would need to enforce the oom selection to be memcg based for all
>> >memcgs. Maybe a mount knob? What do you think?
>> 
>> There is a function in the patch to determine whether the oom_protect 
>> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
>> <is_root_oom_protect> returns 0 by default.
>
>How can an admin determine what is the current oom detection logic?

The memory.oom.protect are set by the administrator themselves, and they 
must know what the current OOM policy is. Reading the memory.oom.protect 
of the first level cgroup directory and observing whether it is 0 can also 
determine whether the oom.protect policy is enabled.

For a process, the physical machine administrator, k8s administrator, 
agent administrator, and container administrator see different effective 
memory.oom.protect for the process, so they only need to pay attention 
to the memory.oom.protect of the local cgroup directory. If an administrator 
wants to know the OOM detection logic of all administrators, I don't think 
there is such a business requirement.

-- 
Thanks for your comment!
Chengkaitao