linux-kernel - Re: [PATCH 0/2] mm->owner to mm->memcg fixes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <87wovjacrh.fsf@xmission.com>
Date:   Thu, 31 May 2018 13:41:38 -0500
From:   ebiederm@...ssion.com (Eric W. Biederman)
To:     Michal Hocko <mhocko@...nel.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Kirill Tkhai <ktkhai@...tuozzo.com>, peterz@...radead.org,
        viro@...iv.linux.org.uk, mingo@...nel.org,
        paulmck@...ux.vnet.ibm.com, keescook@...omium.org, riel@...hat.com,
        tglx@...utronix.de, kirill.shutemov@...ux.intel.com,
        marcos.souza.org@...il.com, hoeun.ryu@...il.com,
        pasha.tatashin@...cle.com, gs051095@...il.com, dhowells@...hat.com,
        rppt@...ux.vnet.ibm.com, linux-kernel@...r.kernel.org,
        Balbir Singh <balbir@...ux.vnet.ibm.com>,
        Tejun Heo <tj@...nel.org>, Oleg Nesterov <oleg@...hat.com>
Subject: Re: [PATCH 0/2] mm->owner to mm->memcg fixes

Michal Hocko <mhocko@...nel.org> writes:

> On Thu 24-05-18 14:16:35, Andrew Morton wrote:
>> On Thu, 24 May 2018 13:10:02 +0200 Michal Hocko <mhocko@...nel.org> wrote:
>> 
>> > I would really prefer and appreciate a repost with all the fixes folded
>> > in.
>> 
>> [1/2]
>
> Thanks Andrew and sorry it took so long! This seems to be missing the
> fix for the issue I've mentioned in http://lkml.kernel.org/r/20180510121838.GE5325@dhcp22.suse.cz
> and Eric wanted to handle by http://lkml.kernel.org/r/87wovu889o.fsf@xmission.com.
> I do not think that this fix is really correct one though. I will
> comment in a reply to that email.

Agreed.  That was not a correct fix to the possible infinite loop in
get_mem_cgroup_from_mm.   Which in net leaves all of this broken and
not ready to be merged.

> In any case I really think this should better be reposted in one patch
> series and restart the discussion. I strongly suggest revisiting my
> previous attempt http://lkml.kernel.org/r/1436358472-29137-8-git-send-email-mhocko@kernel.org
> including the follow up discussion regarding the unfortunate CLONE_VM
> outside of thread group case and potential solution for that.

With my fix for get_mem_cgroup_from_mm falling flat that limits our
possible future directions:
- Make memory cgroups honest and require all tasks of an mm to be in the
  same memory cgroup.  [BEST]
- Don't allow memory cgroups to be removed even if they are only populated by
  an mm.
- If tryget_online fails in get_mem_cgroup_from_mm find a parent of the
  memcg where tryget_online succeeds and use that one.

  That should not be possible to abose as even if a cgroup subtree is
  delegated it should not be possible for the processes to escape the
  subtree.  So in the worst case the charge would walk back up to the
  delegation point.

>> -static void mm_init_owner(struct mm_struct *mm, struct task_struct *p)
>> +static void mm_init_memcg(struct mm_struct *mm)
>>  {
>>  #ifdef CONFIG_MEMCG
>> -	mm->owner = p;
>> +	struct cgroup_subsys_state *css;
>> +
>> +	/* Ensure mm->memcg is initialized */
>> +	mm->memcg = NULL;
>> +
>> +	rcu_read_lock();
>> +	css = task_css(current, memory_cgrp_id);
>> +	if (css && css_tryget(css))
>> +		mm_update_memcg(mm, mem_cgroup_from_css(css));
>> +	rcu_read_unlock();
>
> Is it possible that css_tryget ever fails here? I remember I used to do
> plain css_get in my patch.

Short of races with task migration that can't fail.  As the current
task will keep the memory memory cgroup alive.

I will recheck when I refresh my patch.

>> @@ -4850,15 +4836,16 @@ static int mem_cgroup_can_attach(struct
>>  	if (!move_flags)
>>  		return 0;
>>  
>> -	from = mem_cgroup_from_task(p);
>> +	from = mem_cgroup_from_css(task_css(p, memory_cgrp_id));
>>  
>>  	VM_BUG_ON(from == memcg);
>>  
>>  	mm = get_task_mm(p);
>>  	if (!mm)
>>  		return 0;
>> -	/* We move charges only when we move a owner of the mm */
>> -	if (mm->owner == p) {
c>> +
>> +	/* We move charges except for creative uses of CLONE_VM */
>> +	if (mm->memcg == from) {
>
> I must be missing something here but how do you prevent those cases?
> AFAICS processes sharing the mm will simply allow to migrate.

The mm will only migrate once per set of tasks.  A task that does not
migrate with the mm will not have mm->memcg == from.  Which is all I was
referring to.  Perhaps I did not say that well.

>>  		VM_BUG_ON(mc.from);
>>  		VM_BUG_ON(mc.to);
>>  		VM_BUG_ON(mc.precharge);
>
> Other than that the patch makes sense to me.

I will take a bit and respin this.  Because of algorithmic bug-fix
nature of where this started I want to avoid depending on fixing the
semantics.  With my third option for fixing get_mem_cgroup_from_mm I see
how to do that now.  Then I will include a separate patch to fix the
semantics.  

Eric