linux-kernel - Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being oom killed

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <7EF16CB9-C34A-410B-BEBE-0303C1BB7BA0@didiglobal.com>
Date:   Wed, 30 Nov 2022 15:46:19 +0000
From:   程垲涛 Chengkaitao Cheng 
        <chengkaitao@...iglobal.com>
To:     Tao pilgrim <pilgrimtao@...il.com>,
        "mhocko@...e.com" <mhocko@...e.com>
CC:     "tj@...nel.org" <tj@...nel.org>,
        "lizefan.x@...edance.com" <lizefan.x@...edance.com>,
        "hannes@...xchg.org" <hannes@...xchg.org>,
        "corbet@....net" <corbet@....net>,
        "roman.gushchin@...ux.dev" <roman.gushchin@...ux.dev>,
        "shakeelb@...gle.com" <shakeelb@...gle.com>,
        "akpm@...ux-foundation.org" <akpm@...ux-foundation.org>,
        "songmuchun@...edance.com" <songmuchun@...edance.com>,
        "cgel.zte@...il.com" <cgel.zte@...il.com>,
        "ran.xiaokai@....com.cn" <ran.xiaokai@....com.cn>,
        "viro@...iv.linux.org.uk" <viro@...iv.linux.org.uk>,
        "zhengqi.arch@...edance.com" <zhengqi.arch@...edance.com>,
        "ebiederm@...ssion.com" <ebiederm@...ssion.com>,
        "Liam.Howlett@...cle.com" <Liam.Howlett@...cle.com>,
        "chengzhihao1@...wei.com" <chengzhihao1@...wei.com>,
        "mhocko@...nel.org" <mhocko@...nel.org>,
        "haolee.swjtu@...il.com" <haolee.swjtu@...il.com>,
        "yuzhao@...gle.com" <yuzhao@...gle.com>,
        "willy@...radead.org" <willy@...radead.org>,
        "vasily.averin@...ux.dev" <vasily.averin@...ux.dev>,
        "vbabka@...e.cz" <vbabka@...e.cz>,
        "surenb@...gle.com" <surenb@...gle.com>,
        "sfr@...b.auug.org.au" <sfr@...b.auug.org.au>,
        "mcgrof@...nel.org" <mcgrof@...nel.org>,
        "sujiaxun@...ontech.com" <sujiaxun@...ontech.com>,
        "feng.tang@...el.com" <feng.tang@...el.com>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-doc@...r.kernel.org" <linux-doc@...r.kernel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "linux-fsdevel@...r.kernel.org" <linux-fsdevel@...r.kernel.org>,
        Bagas Sanjaya <bagasdotme@...il.com>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>
Subject: Re: [PATCH] mm: memcontrol: protect the memory in cgroup from being
 oom killed

On 2022-11-30 21:15:06, "Michal Hocko" <mhocko@...e.com> wrote:
> On Wed 30-11-22 15:01:58, chengkaitao wrote:
> > From: chengkaitao <pilgrimtao@...il.com>
> >
> > We created a new interface <memory.oom.protect> for memory, If there is
> > the OOM killer under parent memory cgroup, and the memory usage of a
> > child cgroup is within its effective oom.protect boundary, the cgroup's
> > tasks won't be OOM killed unless there is no unprotected tasks in other
> > children cgroups. It draws on the logic of <memory.min/low> in the
> > inheritance relationship.
>
> Could you be more specific about usecases? How do you tune oom.protect
> wrt to other tunables? How does this interact with the oom_score_adj
> tunining (e.g. a first hand oom victim with the score_adj 1000 sitting
> in a oom protected memcg)?

We prefer users to use score_adj and oom.protect independently. Score_adj is 
a parameter applicable to host, and oom.protect is a parameter applicable to cgroup. 
When the physical machine's memory size is particularly large, the score_adj 
granularity is also very large. However, oom.protect can achieve more fine-grained 
adjustment.

When the score_adj of the processes are the same, I list the following cases 
for explanation,

          root
           |
        cgroup A
       /        \
 cgroup B      cgroup C
(task m,n)     (task x,y)

score_adj(all task) = 0;
oom.protect(cgroup A) = 0;
oom.protect(cgroup B) = 0;
oom.protect(cgroup C) = 3G;
usage(task m) = 1G
usage(task n) = 2G
usage(task x) = 1G
usage(task y) = 2G

oom killer order of cgroup A: n > m > y > x
oom killer order of host:     y = n > x = m

If cgroup A is a directory maintained by users, users can use oom.protect 
to protect relatively important tasks x and y.

However, when score_adj and oom.protect are used at the same time, we 
will also consider the impact of both, as expressed in the following formula. 
but I have to admit that it is an unstable result.
score = task_usage + score_adj * totalpage - eoom.protect * task_usage / local_memcg_usage

> I haven't really read through the whole patch but this struck me odd.

> > @@ -552,8 +552,19 @@ static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
> > 	unsigned long totalpages = totalram_pages() + total_swap_pages;
> > 	unsigned long points = 0;
> > 	long badness;
> > +#ifdef CONFIG_MEMCG
> > +	struct mem_cgroup *memcg;
> > 
> > -	badness = oom_badness(task, totalpages);
> > +	rcu_read_lock();
> > +	memcg = mem_cgroup_from_task(task);
> > +	if (memcg && !css_tryget(&memcg->css))
> > +		memcg = NULL;
> > +	rcu_read_unlock();
> > +
> > +	update_parent_oom_protection(root_mem_cgroup, memcg);
> > +	css_put(&memcg->css);
> > +#endif
> > +	badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
>
> the badness means different thing depending on which memcg hierarchy
> subtree you look at. Scaling based on the global oom could get really
> misleading.

I also took it into consideration. I planned to change "/proc/pid/oom_score" 
to a writable node. When writing to different cgroup paths, different values 
will be output. The default output is root cgroup. Do you think this idea is 
feasible?
-- 
Chengkaitao
Best wishes