linux-kernel - Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZGtoNu7zIRRy7qK0@dhcp22.suse.cz>
Date:   Mon, 22 May 2023 15:03:50 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     程垲涛 Chengkaitao Cheng 
        <chengkaitao@...iglobal.com>
Cc:     "tj@...nel.org" <tj@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "brauner@...nel.org" <brauner@...nel.org>,
        "muchun.song@...ux.dev" <muchun.song@...ux.dev>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
        "ebiederm@...ssion.com" <ebiederm@...ssion.com>,
        "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
        "chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
        "pilgrimtao@...il.com" <pilgrimtao@...il.com>,
        "haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "surenb@...gle.com" <surenb@...gle.com>,
        "sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
        "mcgrof@...nel.org" <mcgrof@...nel.org>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection

[Sorry for a late reply but I was mostly offline last 2 weeks]

On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@...e.com> wrote:
[...]
> >Your cover letter mentions that then "all processes in the cgroup as a
> >whole". That to me reads as oom.group oom killer policy. But a brief
> >look into the patch suggests you are still looking at specific tasks and
> >this has been a concern in the previous version of the patch because
> >memcg accounting and per-process accounting are detached.
> 
> I think the memcg accounting may be more reasonable, as its memory 
> statistics are more comprehensive, similar to active page cache, which 
> also increases the probability of OOM-kill. In the new patch, all the 
> shared memory will also consume the oom_protect quota of the memcg, 
> and the process's oom_protect quota of the memcg will decrease.

I am sorry but I do not follow. Could you elaborate please? Are you
arguing for per memcg or per process metrics?

[...]

> >> In the final discussion of patch v2, we discussed that although the adjustment range 
> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
> >> indicator oom_kill_inherit, which counts the number of times the local and child 
> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
> >> value of oom_protect to achieve the best.
> >
> >What does the best mean in this context?
> 
> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
> with memory.oom.protect, so we have a ruler to measure the optimal value of 
> memory.oom.protect.

An example might help here.

> >> about the semantics of non-leaf memcgs protection,
> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >
> >So the non-leaf memcg is never used as a target? What if the workload is
> >distributed over several sub-groups? Our current oom.group
> >implementation traverses the tree to find a common ancestor in the oom
> >domain with the oom.group.
> 
> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
> be proportionally reduced
> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
> will be proportionally increased
> The purpose of doing so is that users can set oom_protect quota according to 
> their own needs, and the system management process can set appropriate 
> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
> the system management process can indirectly manage all user processes.

I guess that you are trying to say that the oom protection has a
standard hierarchical behavior. And that is fine, well, in fact it is
mandatory for any control knob to have a sane hierarchical properties.
But that doesn't address my above question. Let me try again. When is a
non-leaf memcg potentially selected as the oom victim? It doesn't have
any tasks directly but it might be a suitable target to kill a multi
memcg based workload (e.g. a full container).

> >All that being said and with the usecase described more specifically. I
> >can see that memcg based oom victim selection makes some sense. That
> >menas that it is always a memcg selected and all tasks withing killed.
> >Memcg based protection can be used to evaluate which memcg to choose and
> >the overall scheme should be still manageable. It would indeed resemble
> >memory protection for the regular reclaim.
> >
> >One thing that is still not really clear to me is to how group vs.
> >non-group ooms could be handled gracefully. Right now we can handle that
> >because the oom selection is still process based but with the protection
> >this will become more problematic as explained previously. Essentially
> >we would need to enforce the oom selection to be memcg based for all
> >memcgs. Maybe a mount knob? What do you think?
> 
> There is a function in the patch to determine whether the oom_protect 
> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
> <is_root_oom_protect> returns 0 by default.

How can an admin determine what is the current oom detection logic?

-- 
Michal Hocko
SUSE Labs