lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20150110214316.GF25319@htj.dyndns.org>
Date:	Sat, 10 Jan 2015 16:43:16 -0500
From:	Tejun Heo <tj@...nel.org>
To:	Vladimir Davydov <vdavydov@...allels.com>
Cc:	"Suzuki K. Poulose" <Suzuki.Poulose@....com>,
	Johannes Weiner <hannes@...xchg.org>, linux-mm@...ck.org,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	Will Deacon <Will.Deacon@....com>
Subject: [PATCH cgroup/for-3.19-fixes] cgroup: implement
 cgroup_subsys->unbind() callback

Currently, if a hierarchy doesn't have any live children when it's
unmounted, the hierarchy starts dying by killing its refcnt.  The
expectation is that even if there are lingering dead children which
are lingering due to remaining references, they'll be put in a finite
amount of time.  When the children are finally released, the hierarchy
is destroyed and all controllers bound to it also are released.

However, for memcg, the premise that the lingering refs will be put in
a finite amount time is not true.  In the absense of memory pressure,
dead memcg's may hang around indefinitely pinned by its pages.  This
unfortunately may lead to indefinite hang on the next mount attempt
involving memcg as the mount logic waits for it to get released.

While we can change hierarchy destruction logic such that a hierarchy
is only destroyed when it's not mounted anywhere and all its children,
live or dead, are gone, this makes whether the hierarchy gets
destroyed or not to be determined by factors opaque to userland.
Userland may or may not get a new hierarchy on the next mount attempt.
Worse, if it explicitly wants to create a new hierarchy with different
options or controller compositions involving memcg, it will fail in an
essentially arbitrary manner.

We want to guarantee that a hierarchy is destroyed once the
conditions, unmounted and no visible children, are met.  To aid it,
this patch introduces a new callback cgroup_subsys->unbind() which is
invoked right before the hierarchy a subsystem is bound to starts
dying.  memcg can implement this callback and initiate draining of
remaining refs so that the hierarchy can eventually be released in a
finite amount of time.

Signed-off-by: Tejun Heo <tj@...nel.org>
Cc: Li Zefan <lizefan@...wei.com>
Cc: Johannes Weiner <hannes@...xchg.org>
Cc: Michal Hocko <mhocko@...e.cz>
Cc: Vladimir Davydov <vdavydov@...allels.com>
---
Hello,

> May be, we should kill the ref counter to the memory controller root in
> cgroup_kill_sb only if there is no children at all, neither online nor
> offline.

Ah, thanks for the analysis, but I really wanna avoid making hierarchy
destruction conditions opaque to userland.  This is userland visible
behavior.  It shouldn't be determined by kernel internals invisible
outside.  This patch adds ss->unbind() which memcg can hook into to
kick off draining of residual refs.  If this would work, I'll add this
patch to cgroup/for-3.19-fixes, possibly with stable cc'd.

Thanks.

 Documentation/cgroups/cgroups.txt |   12 +++++++++++-
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |   14 ++++++++++++--
 3 files changed, 24 insertions(+), 3 deletions(-)

--- a/Documentation/cgroups/cgroups.txt
+++ b/Documentation/cgroups/cgroups.txt
@@ -637,7 +637,7 @@ void exit(struct task_struct *task)
 
 Called during task exit.
 
-void bind(struct cgroup *root)
+void bind(struct cgroup_subsys_state *root_css)
 (cgroup_mutex held by caller)
 
 Called when a cgroup subsystem is rebound to a different hierarchy
@@ -645,6 +645,16 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void unbind(struct cgroup_subsys_state *root_css)
+(cgroup_mutex held by caller)
+
+Called when a cgroup subsys is unbound from its current hierarchy. The
+subsystem must guarantee that all offline cgroups are going to be
+released in a finite amount of time after this function is called.
+Such draining can be asynchronous. The following binding of the
+subsystem to a hierarchy will be delayed till the draining is
+complete.
+
 4. Extended attribute usage
 ===========================
 
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -654,6 +654,7 @@ struct cgroup_subsys {
 		     struct cgroup_subsys_state *old_css,
 		     struct task_struct *task);
 	void (*bind)(struct cgroup_subsys_state *root_css);
+	void (*unbind)(struct cgroup_subsys_state *root_css);
 
 	int disabled;
 	int early_init;
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1910,10 +1910,20 @@ static void cgroup_kill_sb(struct super_
 	 * And don't kill the default root.
 	 */
 	if (css_has_online_children(&root->cgrp.self) ||
-	    root == &cgrp_dfl_root)
+	    root == &cgrp_dfl_root) {
 		cgroup_put(&root->cgrp);
-	else
+	} else {
+		struct cgroup_subsys *ss;
+		int i;
+
+		mutex_lock(&cgroup_mutex);
+		for_each_subsys(ss, i)
+			if ((root->subsys_mask & (1UL << i)) && ss->unbind)
+				ss->unbind(cgroup_css(&root->cgrp, ss));
+		mutex_unlock(&cgroup_mutex);
+
 		percpu_ref_kill(&root->cgrp.self.refcnt);
+	}
 
 	kernfs_kill_sb(sb);
 }
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ