linux-kernel - Re: [PATCH 1/2] memcg: reparent charges of children before processing parent

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20140213152745.GE11986@dhcp22.suse.cz>
Date:	Thu, 13 Feb 2014 16:27:45 +0100
From:	Michal Hocko <mhocko@...e.cz>
To:	Hugh Dickins <hughd@...gle.com>
Cc:	Tejun Heo <tj@...nel.org>, Johannes Weiner <hannes@...xchg.org>,
	Filipe Brandenburger <filbranden@...gle.com>,
	Li Zefan <lizefan@...wei.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	Greg Thelen <gthelen@...gle.com>,
	Michel Lespinasse <walken@...gle.com>,
	Markus Blank-Burian <burian@...nster.de>,
	Shawn Bohrer <shawn.bohrer@...il.com>, cgroups@...r.kernel.org,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH 1/2] memcg: reparent charges of children before
 processing parent

On Wed 12-02-14 15:03:31, Hugh Dickins wrote:
> From: Filipe Brandenburger <filbranden@...gle.com>
> 
> Sometimes the cleanup after memcg hierarchy testing gets stuck in
> mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
> 
> There may turn out to be several causes, but a major cause is this: the
> workitem to offline parent can get run before workitem to offline child;
> parent's mem_cgroup_reparent_charges() circles around waiting for the
> child's pages to be reparented to its lrus, but it's holding cgroup_mutex
> which prevents the child from reaching its mem_cgroup_reparent_charges().
> 
> Further testing showed that an ordered workqueue for cgroup_destroy_wq
> is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
> stage on the way can mess up the order before reaching the workqueue.

This whole code path is so complicated by different types of delayed
work that I am not wondering that we have missed that :/

> Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
> all its children (and grandchildren, in the correct order) to have their
> charges reparented first.

That is basically what I was suggesting
http://marc.info/?l=linux-mm&m=139178386407184&w=2 as #1 option. I
cannot say I would like it and I think that reparenting LRUs in
css_offline and then reparent the remaining charges from css_free is a
better solution but let's keep this for later.

> Fixes: e5fca243abae ("cgroup: use a dedicated workqueue for cgroup destruction")
> Signed-off-by: Filipe Brandenburger <filbranden@...gle.com>
> Signed-off-by: Hugh Dickins <hughd@...gle.com>
> Cc: stable@...r.kernel.org # v3.10+ (but will need extra care)

OK, I guess we should go with this one because it describes both the
problem in memcg offlining and requirements from the cgroup core so the
later approach can build on top of it.

Acked-by: Michal Hocko <mhocko@...e.cz>

> ---
> Or, you may prefer my alternative cgroup.c approach in 2/2:
> there's no need for both.  Please note that neither of these patches
> attempts to handle the unlikely case of racy charges made to child
> after its offline, but parent's offline coming before child's free:
> mem_cgroup_css_free()'s backstop call to mem_cgroup_reparent_charges()
> cannot help in that case, with or without these patches.  Fixing that
> would have to be a separate effort - Michal's?

I plan to implement the LRU care for css_offline and charge drain for
css_free later on but I have quite some work on my plate currently so I
cannot promis it will be done right now. I hope to have it soon though.

>  mm/memcontrol.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> --- 3.14-rc2/mm/memcontrol.c	2014-02-02 18:49:07.897302115 -0800
> +++ linux/mm/memcontrol.c	2014-02-11 17:48:07.604582963 -0800
> @@ -6595,6 +6595,7 @@ static void mem_cgroup_css_offline(struc
>  {
>  	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
>  	struct mem_cgroup_event *event, *tmp;
> +	struct cgroup_subsys_state *iter;
>  
>  	/*
>  	 * Unregister events and notify userspace.
> @@ -6611,7 +6612,14 @@ static void mem_cgroup_css_offline(struc
>  	kmem_cgroup_css_offline(memcg);
>  
>  	mem_cgroup_invalidate_reclaim_iterators(memcg);
> -	mem_cgroup_reparent_charges(memcg);
> +
> +	/*
> +	 * This requires that offlining is serialized.  Right now that is
> +	 * guaranteed because css_killed_work_fn() holds the cgroup_mutex.
> +	 */
> +	css_for_each_descendant_post(iter, css)
> +		mem_cgroup_reparent_charges(mem_cgroup_from_css(iter));
> +
>  	mem_cgroup_destroy_all_caches(memcg);
>  	vmpressure_cleanup(&memcg->vmpressure);
>  }

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/