[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20140410030855.GA29658@mail.hallyn.com>
Date: Thu, 10 Apr 2014 05:08:55 +0200
From: "Serge E. Hallyn" <serge@...lyn.com>
To: Tejun Heo <tj@...nel.org>
Cc: gregkh@...uxfoundation.org, rlove@...ve.org,
containers@...ts.linux-foundation.org, serge.hallyn@...ntu.com,
kay@...y.org, linux-kernel@...r.kernel.org, lennart@...ttering.net,
cgroups@...r.kernel.org, eparis@...isplace.org,
john@...nmccutchan.com
Subject: Re: [PATCH 3/3] cgroup: implement cgroup.subtree_populated for the
default hierarchy
Quoting Tejun Heo (tj@...nel.org):
> cgroup users often need a way to determine when a cgroup's
> subhierarchy becomes empty so that it can be cleaned up. cgroup
> currently provides release_agent for it; unfortunately, this mechanism
> is riddled with issues.
Thanks, Tejun.
> * It delivers events by forking and execing a userland binary
> specified as the release_agent. This is a long deprecated method of
> notification delivery. It's extremely heavy, slow and cumbersome to
> integrate with larger infrastructure.
(Not seriously worried about this, but it's a point worth considering)
It does have one advantage though: if the userspace agent goes bad,
cgroups can still be removed on empty.
Do you plan on keeping release-on-empty around? I assume only for a
while?
Do you think there is any value in having a simpler "remove-when-empty"
file? Doesn't call out to userspace, just drops the cgroup when there
are no more tasks or sub-cgroups?
> * There is single monitoring point at the root. There's no way to
> delegate management of subtree.
>
> * The event isn't recursive. It triggers when a cgroup doesn't have
> any tasks or child cgroups. Events for internal nodes trigger only
> after all children are removed. This again makes it impossible to
> delegate management of subtree.
>
> * Events are filtered from the kernel side. "notify_on_release" file
> is used to subscribe to or suppres release event and events are not
> generated if a cgroup becomes empty by moving the last task out of
> it; however, event is generated if it becomes empty because the last
> child cgroup is removed. This is inconsistent, awkward and
Hm, maybe I'm misreading but this doesn't seem right. If I move
a task into x1 and kill the task, x1 goes away. Likewise if I
create x1/y1, and rmdir y1, x1 goes away. I suspect I'm misunderstanding
the case in which you say it doesn't happen?
> unnecessarily complicated and probably done this way because event
> delivery itself was expensive.
>
> This patch implements interface file "cgroup.subtree_populated" which
> can be used to monitor whether the cgroup's subhierarchy has tasks in
> it or not. Its value is 1 if there is no task in the cgroup and its
I think you meant this backward? It's 1 if there is *any task in
the cgroup and its descendants, else 0?
> descendants; otherwise, 0, and kernfs_notify() notificaiton is
> triggers when the value changes, which can be monitored through poll
> and [di]notify.
>
> This is a lot ligther and simpler and trivially allows delegating
> management of subhierarchy - subhierarchy monitoring can block further
> propgation simply by putting itself or another process in the root of
> the subhierarchy and monitor events that it's interested in from there
> without interfering with monitoring higher in the tree.
>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Cc: Serge Hallyn <serge.hallyn@...ntu.com>
Acked-by: Serge Hallyn <serge.hallyn@...ntu.com>
> Cc: Lennart Poettering <lennart@...ttering.net>
> ---
> include/linux/cgroup.h | 15 ++++++++++++
> kernel/cgroup.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++----
> 2 files changed, 76 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index dee6f3c..e45d87f 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -154,6 +154,14 @@ struct cgroup {
> /* the number of attached css's */
> int nr_css;
>
> + /*
> + * If this cgroup contains any tasks, it contributes one to
> + * populated_cnt. All children with non-zero popuplated_cnt of
> + * their own contribute one. The count is zero iff there's no task
> + * in this cgroup or its subtree.
> + */
> + int populated_cnt;
> +
> atomic_t refcnt;
>
> /*
> @@ -166,6 +174,7 @@ struct cgroup {
> struct cgroup *parent; /* my parent */
> struct kernfs_node *kn; /* cgroup kernfs entry */
> struct kernfs_node *control_kn; /* kn for "cgroup.subtree_control" */
> + struct kernfs_node *populated_kn; /* kn for "cgroup.subtree_populated" */
>
> /*
> * Monotonically increasing unique serial number which defines a
> @@ -264,6 +273,12 @@ enum {
> *
> * - "cgroup.clone_children" is removed.
> *
> + * - "cgroup.subtree_populated" is available. Its value is 0 if
> + * the cgroup and its descendants contain no task; otherwise, 1.
> + * The file also generates kernfs notification which can be
> + * monitored through poll and [di]notify when the value of the
> + * file changes.
> + *
> * - If mount is requested with sane_behavior but without any
> * subsystem, the default unified hierarchy is mounted.
> *
> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
> index 4e958c7..17f0a09 100644
> --- a/kernel/cgroup.c
> +++ b/kernel/cgroup.c
> @@ -411,6 +411,43 @@ static struct css_set init_css_set = {
>
> static int css_set_count = 1; /* 1 for init_css_set */
>
> +/**
> + * cgroup_update_populated - updated populated count of a cgroup
> + * @cgrp: the target cgroup
> + * @populated: inc or dec populated count
> + *
> + * @cgrp is either getting the first task (css_set) or losing the last.
> + * Update @cgrp->populated_cnt accordingly. The count is propagated
> + * towards root so that a given cgroup's populated_cnt is zero iff the
> + * cgroup and all its descendants are empty.
> + *
> + * @cgrp's interface file "cgroup.subtree_populated" is zero if
> + * @cgrp->populated_cnt is zero and 1 otherwise. When @cgrp->populated_cnt
> + * changes from or to zero, userland is notified that the content of the
> + * interface file has changed. This can be used to detect when @cgrp and
> + * its descendants become populated or empty.
> + */
> +static void cgroup_update_populated(struct cgroup *cgrp, bool populated)
> +{
> + lockdep_assert_held(&css_set_rwsem);
> +
> + do {
> + bool trigger;
> +
> + if (populated)
> + trigger = !cgrp->populated_cnt++;
> + else
> + trigger = !--cgrp->populated_cnt;
> +
> + if (!trigger)
> + break;
> +
> + if (cgrp->populated_kn)
> + kernfs_notify(cgrp->populated_kn);
> + cgrp = cgrp->parent;
> + } while (cgrp);
> +}
> +
> /*
> * hash table for cgroup groups. This improves the performance to find
> * an existing css_set. This hash doesn't (currently) take into
> @@ -456,10 +493,13 @@ static void put_css_set_locked(struct css_set *cset, bool taskexit)
> list_del(&link->cgrp_link);
>
> /* @cgrp can't go away while we're holding css_set_rwsem */
> - if (list_empty(&cgrp->cset_links) && notify_on_release(cgrp)) {
> - if (taskexit)
> - set_bit(CGRP_RELEASABLE, &cgrp->flags);
> - check_for_release(cgrp);
> + if (list_empty(&cgrp->cset_links)) {
> + cgroup_update_populated(cgrp, false);
> + if (notify_on_release(cgrp)) {
> + if (taskexit)
> + set_bit(CGRP_RELEASABLE, &cgrp->flags);
> + check_for_release(cgrp);
> + }
> }
>
> kfree(link);
> @@ -668,7 +708,11 @@ static void link_css_set(struct list_head *tmp_links, struct css_set *cset,
> link = list_first_entry(tmp_links, struct cgrp_cset_link, cset_link);
> link->cset = cset;
> link->cgrp = cgrp;
> +
> + if (list_empty(&cgrp->cset_links))
> + cgroup_update_populated(cgrp, true);
> list_move(&link->cset_link, &cgrp->cset_links);
> +
> /*
> * Always add links to the tail of the list so that the list
> * is sorted by order of hierarchy creation
> @@ -2633,6 +2677,12 @@ err_undo_css:
> goto out_unlock;
> }
>
> +static int cgroup_subtree_populated_show(struct seq_file *seq, void *v)
> +{
> + seq_printf(seq, "%d\n", (bool)seq_css(seq)->cgroup->populated_cnt);
> + return 0;
> +}
> +
> static ssize_t cgroup_file_write(struct kernfs_open_file *of, char *buf,
> size_t nbytes, loff_t off)
> {
> @@ -2775,6 +2825,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cftype *cft)
> NULL, false, key);
> if (cft->seq_show == cgroup_subtree_control_show)
> cgrp->control_kn = kn;
> + else if (cft->seq_show == cgroup_subtree_populated_show)
> + cgrp->populated_kn = kn;
> return PTR_ERR_OR_ZERO(kn);
> }
>
> @@ -3883,6 +3935,11 @@ static struct cftype cgroup_base_files[] = {
> .seq_show = cgroup_subtree_control_show,
> .write_string = cgroup_subtree_control_write,
> },
> + {
> + .name = "cgroup.subtree_populated",
> + .flags = CFTYPE_ONLY_ON_DFL | CFTYPE_NOT_ON_ROOT,
> + .seq_show = cgroup_subtree_populated_show,
> + },
>
> /*
> * Historical crazy stuff. These don't have "cgroup." prefix and
> --
> 1.9.0
>
> _______________________________________________
> Containers mailing list
> Containers@...ts.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists