linux-kernel - Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <C2CC36C1-29AE-4B65-A18A-19A745652182@didiglobal.com>
Date:   Thu, 1 Dec 2022 10:52:35 +0000
From:   程垲涛 Chengkaitao Cheng 
        <chengkaitao@...iglobal.com>
To:     Michal Hocko <mhocko@...e.com>
CC:     Tao pilgrim <pilgrimtao@...il.com>,
        "tj@...nel.org" <tj@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "songmuchun@...edance.com" <songmuchun@...edance.com>,
        "cgel.zte@...il.com" <cgel.zte@...il.com>,
        "ran.xiaokai@....com.cn" <ran.xiaokai@....com.cn>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
        "ebiederm@...ssion.com" <ebiederm@...ssion.com>,
        "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
        "chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
        "haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "surenb@...gle.com" <surenb@...gle.com>,
        "sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
        "mcgrof@...nel.org" <mcgrof@...nel.org>,
        "sujiaxun@...ontech.com" <sujiaxun@...ontech.com>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        "Bagas Sanjaya" <bagasdotme@...il.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being
 oom killed

At 2022-12-01 16:49:27, "Michal Hocko" <mhocko@...e.com> wrote:
>On Thu 01-12-22 04:52:27, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-01 00:27:54, "Michal Hocko" <mhocko@...e.com> wrote:
>> >On Wed 30-11-22 15:46:19, 程垲涛 Chengkaitao Cheng wrote:
>> >> On 2022-11-30 21:15:06, "Michal Hocko" <mhocko@...e.com> wrote:
>> >> > On Wed 30-11-22 15:01:58, chengkaitao wrote:
>> >> > > From: chengkaitao <pilgrimtao@...il.com>
>> >> > >
>> >> > > We created a new interface <memory.oom.protect> for memory, If there is
>> >> > > the OOM killer under parent memory cgroup, and the memory usage of a
>> >> > > child cgroup is within its effective oom.protect boundary, the cgroup's
>> >> > > tasks won't be OOM killed unless there is no unprotected tasks in other
>> >> > > children cgroups. It draws on the logic of <memory.min/low> in the
>> >> > > inheritance relationship.
>> >> >
>> >> > Could you be more specific about usecases?
>> >
>> >This is a very important question to answer.
>> 
>> usecases 1: users say that they want to protect an important process 
>> with high memory consumption from being killed by the oom in case 
>> of docker container failure, so as to retain more critical on-site 
>> information or a self recovery mechanism. At this time, they suggest 
>> setting the score_adj of this process to -1000, but I don't agree with 
>> it, because the docker container is not important to other docker 
>> containers of the same physical machine. If score_adj of the process 
>> is set to -1000, the probability of oom in other container processes will 
>> increase.
>> 
>> usecases 2: There are many business processes and agent processes 
>> mixed together on a physical machine, and they need to be classified 
>> and protected. However, some agents are the parents of business 
>> processes, and some business processes are the parents of agent 
>> processes, It will be troublesome to set different score_adj for them. 
>> Business processes and agents cannot determine which level their 
>> score_adj should be at, If we create another agent to set all processes's 
>> score_adj, we have to cycle through all the processes on the physical 
>> machine regularly, which looks stupid.
>
>I do agree that oom_score_adj is far from ideal tool for these usecases.
>But I also agree with Roman that these could be addressed by an oom
>killer implementation in the userspace which can have much better
>tailored policies. OOM protection limits would require tuning and also
>regular revisions (e.g. memory consumption by any workload might change
>with different kernel versions) to provide what you are looking for.

There is a misunderstanding, oom.protect does not replace the user's 
tailed policies, Its purpose is to make it easier and more efficient for 
users to customize policies, or try to avoid users completely abandoning 
the oom score to formulate new policies.

>> >> > How do you tune oom.protect
>> >> > wrt to other tunables? How does this interact with the oom_score_adj
>> >> > tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
>> >> > in a oom protected memcg)?
>> >> 
>> >> We prefer users to use score_adj and oom.protect independently. Score_adj is 
>> >> a parameter applicable to host, and oom.protect is a parameter applicable to cgroup. 
>> >> When the physical machine's memory size is particularly large, the score_adj 
>> >> granularity is also very large. However, oom.protect can achieve more fine-grained 
>> >> adjustment.
>> >
>> >Let me clarify a bit. I am not trying to defend oom_score_adj. It has
>> >it's well known limitations and it is is essentially unusable for many
>> >situations other than - hide or auto-select potential oom victim.
>> >
>> >> When the score_adj of the processes are the same, I list the following cases 
>> >> for explanation,
>> >> 
>> >>           root
>> >>            |
>> >>         cgroup A
>> >>        /        \
>> >>  cgroup B      cgroup C
>> >> (task m,n)     (task x,y)
>> >> 
>> >> score_adj(all task) = 0;
>> >> oom.protect(cgroup A) = 0;
>> >> oom.protect(cgroup B) = 0;
>> >> oom.protect(cgroup C) = 3G;
>> >
>> >How can you enforce protection at C level without any protection at A
>> >level? 
>> 
>> The basic idea of this scheme is that all processes in the same cgroup are 
>> equally important. If some processes need extra protection, a new cgroup 
>> needs to be created for unified settings. I don't think it is necessary to 
>> implement protection in cgroup C, because task x and task y are equally 
>> important. Only the four processes (task m, n, x and y) in cgroup A, have 
>> important and secondary differences.
>> 
>> > This would easily allow arbitrary cgroup to hide from the oom
>> > killer and spill over to other cgroups.
>> 
>> I don't think this will happen, because eoom.protect only works on parent 
>> cgroup. If "oom.protect(parent cgroup) = 0", from perspective of 
>> grandpa cgroup, task x and y will not be specially protected.
>
>Just to confirm I am on the same page. This means that there won't be
>any protection in case of the global oom in the above example. So
>effectively the same semantic as the low/min protection.
>
>> >> usage(task m) = 1G
>> >> usage(task n) = 2G
>> >> usage(task x) = 1G
>> >> usage(task y) = 2G
>> >> 
>> >> oom killer order of cgroup A: n > m > y > x
>> >> oom killer order of host:     y = n > x = m
>> >> 
>> >> If cgroup A is a directory maintained by users, users can use oom.protect 
>> >> to protect relatively important tasks x and y.
>> >> 
>> >> However, when score_adj and oom.protect are used at the same time, we 
>> >> will also consider the impact of both, as expressed in the following formula. 
>> >> but I have to admit that it is an unstable result.
>> >> score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage
>> >
>> >I hope I am not misreading but this has some rather unexpected
>> >properties. First off, bigger memory consumers in a protected memcg are
>> >protected more. 
>> 
>> Since cgroup needs to reasonably distribute the protection quota to all 
>> processes in the cgroup, I think that processes consuming more memory 
>> should get more quota. It is fair to processes consuming less memory 
>> too, even if processes consuming more memory get more quota, its 
>> oom_score is still higher than the processes consuming less memory. 
>> When the oom killer appears in local cgroup, the order of oom killer 
>> remains unchanged
>
>Why cannot you simply discount the protection from all processes
>equally? I do not follow why the task_usage has to play any role in
>that.

If all processes are protected equally, the oom protection of cgroup is 
meaningless. For example, if there are more processes in the cgroup, 
the cgroup can protect more mems, it is unfair to cgroups with fewer 
processes. So we need to keep the total amount of memory that all 
processes in the cgroup need to protect consistent with the value of 
eoom.protect.
>> 
>> >Also I would expect the protection discount would
>> >be capped by the actual usage otherwise excessive protection
>> >configuration could skew the results considerably.
>> 
>> In the calculation, we will select the minimum value of memcg_usage and 
>> oom.protect
>> 
>> >> > I haven't really read through the whole patch but this struck me odd.
>> >> 
>> >> > > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
>> >> > > 	unsigned long totalpages = totalram_pages() + total_swap_pages;
>> >> > > 	unsigned long points = 0;
>> >> > > 	long badness;
>> >> > > +#ifdef CONFIG_MEMCG
>> >> > > +	struct mem_cgroup *memcg;
>> >> > > 
>> >> > > -	badness = oom_badness(task, totalpages);
>> >> > > +	rcu_read_lock();
>> >> > > +	memcg = mem_cgroup_from_task(task);
>> >> > > +	if (memcg && !css_tryget(&memcg->css))
>> >> > > +		memcg = NULL;
>> >> > > +	rcu_read_unlock();
>> >> > > +
>> >> > > +	update_parent_oom_protection(root_mem_cgroup, memcg);
>> >> > > +	css_put(&memcg->css);
>> >> > > +#endif
>> >> > > +	badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
>> >> >
>> >> > the badness means different thing depending on which memcg hierarchy
>> >> > subtree you look at. Scaling based on the global oom could get really
>> >> > misleading.
>> >> 
>> >> I also took it into consideration. I planned to change "/proc/pid/oom_score" 
>> >> to a writable node. When writing to different cgroup paths, different values 
>> >> will be output. The default output is root cgroup. Do you think this idea is 
>> >> feasible?
>> >
>> >I do not follow. Care to elaborate?
>> 
>> Take two example，
>> cmd: cat /proc/pid/oom_score
>> output: Scaling based on the global oom
>> 
>> cmd: echo "/cgroupA/cgroupB" > /proc/pid/oom_score
>> output: Scaling based on the cgroupB oom
>> (If the task is not in the cgroupB's hierarchy subtree, output: invalid parameter)
>
>This is a terrible interface. First of all it assumes a state for the
>file without any way to guarantee atomicity. How do you deal with two
>different callers accessing the file?

When the echo command is executed, the kernel will directly return the 
calculated oom_score. We do not need to perform additional cat command, 
and all temporary data will be discarded immediately after the echo operation. 
When the cat command is executed, the kernel treats the default value as root 
cgroup, so these two operations are atomic.
>
Thanks for your comment!
chengkaitao