linux-kernel - Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all() invocation from mem_cgroup_move

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Sun, 17 Apr 2016 08:07:48 -0400
From:	Michal Hocko <mhocko@...nel.org>
To:	Tejun Heo <tj@...nel.org>
Cc:	Johannes Weiner <hannes@...xchg.org>,
	Petr Mladek <pmladek@...e.com>, cgroups@...r.kernel.org,
	Cyril Hrubis <chrubis@...e.cz>, linux-kernel@...r.kernel.org
Subject: Re: [PATCH for-4.6-fixes] memcg: remove lru_add_drain_all()
 invocation from mem_cgroup_move_charge()

On Fri 15-04-16 15:17:19, Tejun Heo wrote:
> mem_cgroup_move_charge() invokes lru_add_drain_all() so that the pvec
> pages can be moved too.  lru_add_drain_all() schedules and flushes
> work items on system_wq which depends on being able to create new
> kworkers to make forward progress.  Since 1ed1328792ff ("sched,
> cgroup: replace signal_struct->group_rwsem with a global
> percpu_rwsem"), a new task can't be created while in the cgroup
> migration path and the described lru_add_drain_all() invocation can
> easily lead to a deadlock.
> 
> Charge moving is best-effort and whether the pvec pages are migrated
> or not doesn't really matter.  Don't call it during charge moving.
> Eventually, we want to move the actual charge moving outside the
> migration path.
> 
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Reported-by: Johannes Weiner <hannes@...xchg.org>

I guess
Debugged-by: Petr Mladek <pmladek@...e.com>
Reported-by: Cyril Hrubis <chrubis@...e.cz>

would be due

> Suggested-by: Michal Hocko <mhocko@...nel.org>
> Fixes: 1ed1328792ff ("sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem")
> Cc: stable@...r.kernel.org
> ---
> Hello,
> 
> So, this deadlock seems pretty easy to trigger.  We'll make the charge
> moving asynchronous eventually but let's not hold off fixing an
> immediate problem.

Although this looks rather straightforward and it fixes the immediate
problem I am little bit nervous about it. As already pointed out in
other email mem_cgroup_move_charge still depends on mmap_sem for
read and we might hit an even more subtle lockup if the current holder
of the mmap_sem for write depends on the task creation (e.g. some of the
direct reclaim path uses WQ which is really hard to rule out and I even
think that some shrinkers do this).

I liked your proposal when mem_cgroup_move_charge would be called from a
context which doesn't hold the problematic rwsem much more. Would that
be too intrusive for the stable backport?

> Thanks.
> 
>  mm/memcontrol.c |    1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 36db05f..56060c7 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4859,7 +4859,6 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
>  		.mm = mm,
>  	};
>  
> -	lru_add_drain_all();
>  	/*
>  	 * Signal lock_page_memcg() to take the memcg's move_lock
>  	 * while we're moving its pages to another memcg. Then wait

-- 
Michal Hocko
SUSE Labs