[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <ZGtoNu7zIRRy7qK0@dhcp22.suse.cz>
Date: Mon, 22 May 2023 15:03:50 +0200
From: Michal Hocko <mhocko@...e.com>
To: 程垲涛 Chengkaitao Cheng
<chengkaitao@...iglobal.com>
Cc: "tj@...nel.org" <tj@...nel.org>,
"lizefan.x@...edance.com" <lizefan.x@...edance.com>,
"hannes@...xchg.org" <hannes@...xchg.org>,
"corbet@....net" <corbet@....net>,
"roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
"shakeelb@...gle.com" <shakeelb@...gle.com>,
"akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
"brauner@...nel.org" <brauner@...nel.org>,
"muchun.song@...ux.dev" <muchun.song@...ux.dev>,
"viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
"zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
"ebiederm@...ssion.com" <ebiederm@...ssion.com>,
"Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
"chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
"pilgrimtao@...il.com" <pilgrimtao@...il.com>,
"haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
"yuzhao@...gle.com" <yuzhao@...gle.com>,
"willy@...radead.org" <willy@...radead.org>,
"vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
"vbabka@...e.cz" <vbabka@...e.cz>,
"surenb@...gle.com" <surenb@...gle.com>,
"sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
"mcgrof@...nel.org" <mcgrof@...nel.org>,
"feng.tang@...el.com" <feng.tang@...el.com>,
"cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
"linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: [PATCH v3 0/2] memcontrol: support cgroup level OOM protection
[Sorry for a late reply but I was mostly offline last 2 weeks]
On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@...e.com> wrote:
[...]
> >Your cover letter mentions that then "all processes in the cgroup as a
> >whole". That to me reads as oom.group oom killer policy. But a brief
> >look into the patch suggests you are still looking at specific tasks and
> >this has been a concern in the previous version of the patch because
> >memcg accounting and per-process accounting are detached.
>
> I think the memcg accounting may be more reasonable, as its memory
> statistics are more comprehensive, similar to active page cache, which
> also increases the probability of OOM-kill. In the new patch, all the
> shared memory will also consume the oom_protect quota of the memcg,
> and the process's oom_protect quota of the memcg will decrease.
I am sorry but I do not follow. Could you elaborate please? Are you
arguing for per memcg or per process metrics?
[...]
> >> In the final discussion of patch v2, we discussed that although the adjustment range
> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new
> >> indicator oom_kill_inherit, which counts the number of times the local and child
> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
> >> value of oom_protect to achieve the best.
> >
> >What does the best mean in this context?
>
> I have created a new indicator oom_kill_inherit that maintains a negative correlation
> with memory.oom.protect, so we have a ruler to measure the optimal value of
> memory.oom.protect.
An example might help here.
> >> about the semantics of non-leaf memcgs protection,
> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >
> >So the non-leaf memcg is never used as a target? What if the workload is
> >distributed over several sub-groups? Our current oom.group
> >implementation traverses the tree to find a common ancestor in the oom
> >domain with the oom.group.
>
> If the oom_protect quota of the parent non-leaf memcg is less than the sum of
> sub-groups oom_protect quota, the oom_protect quota of each sub-group will
> be proportionally reduced
> If the oom_protect quota of the parent non-leaf memcg is greater than the sum
> of sub-groups oom_protect quota, the oom_protect quota of each sub-group
> will be proportionally increased
> The purpose of doing so is that users can set oom_protect quota according to
> their own needs, and the system management process can set appropriate
> oom_protect quota on the parent non-leaf memcg as the final cover, so that
> the system management process can indirectly manage all user processes.
I guess that you are trying to say that the oom protection has a
standard hierarchical behavior. And that is fine, well, in fact it is
mandatory for any control knob to have a sane hierarchical properties.
But that doesn't address my above question. Let me try again. When is a
non-leaf memcg potentially selected as the oom victim? It doesn't have
any tasks directly but it might be a suitable target to kill a multi
memcg based workload (e.g. a full container).
> >All that being said and with the usecase described more specifically. I
> >can see that memcg based oom victim selection makes some sense. That
> >menas that it is always a memcg selected and all tasks withing killed.
> >Memcg based protection can be used to evaluate which memcg to choose and
> >the overall scheme should be still manageable. It would indeed resemble
> >memory protection for the regular reclaim.
> >
> >One thing that is still not really clear to me is to how group vs.
> >non-group ooms could be handled gracefully. Right now we can handle that
> >because the oom selection is still process based but with the protection
> >this will become more problematic as explained previously. Essentially
> >we would need to enforce the oom selection to be memcg based for all
> >memcgs. Maybe a mount knob? What do you think?
>
> There is a function in the patch to determine whether the oom_protect
> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function
> <is_root_oom_protect> returns 0 by default.
How can an admin determine what is the current oom detection logic?
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists