linux-kernel - Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e9682f27-a23c-802c-daed-e49fc9d4613c@redhat.com>
Date:   Fri, 11 Aug 2017 16:19:07 -0400
From:   Waiman Long <longman@...hat.com>
To:     Tejun Heo <tj@...nel.org>, lizefan@...wei.com, hannes@...xchg.org,
        peterz@...radead.org, mingo@...hat.com
Cc:     cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        kernel-team@...com, pjt@...gle.com, luto@...capital.net,
        efault@....de, torvalds@...ux-foundation.org, guro@...com
Subject: Re: [PATCH 3/3] cgroup: Implement cgroup2 basic CPU usage accounting

On 08/11/2017 12:37 PM, Tejun Heo wrote:
> In cgroup1, while cpuacct isn't actually controlling any resources, it
> is a separate controller due to combinaton of two factors -
> 1. enabling cpu controller has significant side effects, and 2. we
> have to pick one of the hierarchies to account CPU usages on.  cpuacct
> controller is effectively used to designate a hierarchy to track CPU
> usages on.
>
> cgroup2's unified hierarchy removes the second reason and we can
> account basic CPU usages by default.  While we can use cpuacct for
> this purpose, both its interface and implementation leave a lot to be
> desired - it collects and exposes two sources of truth which don't
> agree with each other and some of the exposed statistics don't make
> much sense.  Also, it propagates all the way up the hierarchy on each
> accounting event which is unnecessary.
>
> This patch adds basic resource accounting mechanism to cgroup2's
> unified hierarchy and accounts CPU usages using it.
>
> * All accountings are done per-cpu and don't propagate immediately.
>   It just bumps the per-cgroup per-cpu counters and links to the
>   parent's updated list if not already on it.
>
> * On a read, the per-cpu counters are collected into the global ones
>   and then propagated upwards.  Only the per-cpu counters which have
>   changed since the last read are propagated.
>
> * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
>   prefix.  Total usage is collected from scheduling events.  User/sys
>   breakdown is sourced from tick sampling and adjusted to the usage
>   using cputime_adjuts().

Typo: cputime_adjust()

> This keeps the accounting side hot path O(1) and per-cpu and the read
> side O(nr_updated_since_last_read).
>
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Cc: Ingo Molnar <mingo@...hat.com>
> Cc: Peter Zijlstra <peterz@...radead.org>
> Cc: Li Zefan <lizefan@...wei.com>
> Cc: Johannes Weiner <hannes@...xchg.org>
> ---
>  Documentation/cgroup-v2.txt     |   9 ++
>  include/linux/cgroup-defs.h     |  47 ++++++
>  include/linux/cgroup.h          |  22 +++
>  kernel/cgroup/Makefile          |   2 +-
>  kernel/cgroup/cgroup-internal.h |   8 +
>  kernel/cgroup/cgroup.c          |  24 ++-
>  kernel/cgroup/stat.c            | 333 ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 442 insertions(+), 3 deletions(-)
>  create mode 100644 kernel/cgroup/stat.c
>
> diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt
> index dc44785..3f82169 100644
> --- a/Documentation/cgroup-v2.txt
> +++ b/Documentation/cgroup-v2.txt
> @@ -886,6 +886,15 @@ All cgroup core files are prefixed with "cgroup."
>  		A dying cgroup can consume system resources not exceeding
>  		limits, which were active at the moment of cgroup deletion.
>  
> +	  cpu.usage_usec
> +		CPU time consumed in the subtree.
> +
> +	  cpu.user_usec
> +		User CPU time consumed in the subtree.
> +
> +	  cpu.system_usec
> +		System CPU time consumed in the subtree.
> +
>  
>  Controllers
>  ===========
> diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
> index 59e4ad9..17da9c8 100644
> --- a/include/linux/cgroup-defs.h
> +++ b/include/linux/cgroup-defs.h
> @@ -16,6 +16,7 @@
>  #include <linux/refcount.h>
>  #include <linux/percpu-refcount.h>
>  #include <linux/percpu-rwsem.h>
> +#include <linux/u64_stats_sync.h>
>  #include <linux/workqueue.h>
>  #include <linux/bpf-cgroup.h>
>  
> @@ -249,6 +250,47 @@ struct css_set {
>  	struct rcu_head rcu_head;
>  };
>  
> +/*
> + * cgroup basic resource usage statistics.  Accounting is done per-cpu in
> + * cgroup_cpu_stat which is then lazily propagated up the hierarchy on
> + * reads.  The propagation is selective - only the cgroup_cpu_stats which
> + * have been updated since the last propagation are propagated.
> + */
> +struct cgroup_cpu_stat {
> +	/*
> +	 * ->sync protects all the current counters.  These are the only
> +	 * fields which get updated in the hot path.
> +	 */
> +	struct u64_stats_sync sync;
> +	struct task_cputime cputime;
> +
> +	/*
> +	 * Snapshots at the last reading.  These are used to calculate the
> +	 * deltas to propagate to the global counters.
> +	 */
> +	struct task_cputime last_cputime;
> +
> +	/*
> +	 * Child cgroups with stat updates on this cpu since the last read
> +	 * are linked on the parent's ->updated_children through
> +	 * ->updated_next.
> +	 *
> +	 * In addition to being more compact, singly-linked list pointing
> +	 * to the cgroup makes it unnecessary for each per-cpu struct to
> +	 * point back to the associated cgroup.
> +	 *
> +	 * Protected by per-cpu cgroup_cpu_stat_lock.
> +	 */
> +	struct cgroup *updated_children;	/* terminated by self cgroup */
> +	struct cgroup *updated_next;		/* NULL iff not on the list */
> +};
> +
> +struct cgroup_stat {
> +	/* per-cpu statistics are collected into the folowing global counters */
> +	struct task_cputime cputime;
> +	struct prev_cputime prev_cputime;
> +};
> +
>  struct cgroup {
>  	/* self css with NULL ->ss, points back to this cgroup */
>  	struct cgroup_subsys_state self;
> @@ -348,6 +390,11 @@ struct cgroup {
>  	 */
>  	struct cgroup *dom_cgrp;
>  
> +	/* cgroup basic resource statistics */
> +	struct cgroup_cpu_stat __percpu *cpu_stat;
> +	struct cgroup_stat pending_stat;	/* pending from children */
> +	struct cgroup_stat stat;
> +
>  	/*
>  	 * list of pidlists, up to two for each namespace (one for procs, one
>  	 * for tasks); created on demand.
> diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
> index f395e02..4c82212 100644
> --- a/include/linux/cgroup.h
> +++ b/include/linux/cgroup.h
> @@ -689,17 +689,39 @@ static inline void cpuacct_account_field(struct task_struct *tsk, int index,
>  					 u64 val) {}
>  #endif
>  
> +void cgroup_stat_show_cputime(struct seq_file *seq, const char *prefix);
> +
> +void __cgroup_account_cputime(struct cgroup *cgrp, u64 delta_exec);
> +void __cgroup_account_cputime_field(struct cgroup *cgrp,
> +				    enum cpu_usage_stat index, u64 delta_exec);
> +
>  static inline void cgroup_account_cputime(struct task_struct *task,
>  					  u64 delta_exec)
>  {
> +	struct cgroup *cgrp;
> +
>  	cpuacct_charge(task, delta_exec);
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(task);
> +	if (cgroup_parent(cgrp))
> +		__cgroup_account_cputime(cgrp, delta_exec);
> +	rcu_read_unlock();
>  }
>  
>  static inline void cgroup_account_cputime_field(struct task_struct *task,
>  						enum cpu_usage_stat index,
>  						u64 delta_exec)
>  {
> +	struct cgroup *cgrp;
> +
>  	cpuacct_account_field(task, index, delta_exec);
> +
> +	rcu_read_lock();
> +	cgrp = task_dfl_cgroup(task);
> +	if (cgroup_parent(cgrp))
> +		__cgroup_account_cputime_field(cgrp, index, delta_exec);
> +	rcu_read_unlock();
>  }
>  
>  #else	/* CONFIG_CGROUPS */
> diff --git a/kernel/cgroup/Makefile b/kernel/cgroup/Makefile
> index ce693cc..0acee61 100644
> --- a/kernel/cgroup/Makefile
> +++ b/kernel/cgroup/Makefile
> @@ -1,4 +1,4 @@
> -obj-y := cgroup.o namespace.o cgroup-v1.o
> +obj-y := cgroup.o stat.o namespace.o cgroup-v1.o
>  
>  obj-$(CONFIG_CGROUP_FREEZER) += freezer.o
>  obj-$(CONFIG_CGROUP_PIDS) += pids.o
> diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
> index c167a40..f17da09 100644
> --- a/kernel/cgroup/cgroup-internal.h
> +++ b/kernel/cgroup/cgroup-internal.h
> @@ -197,6 +197,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node,
>  int cgroup_task_count(const struct cgroup *cgrp);
>  
>  /*
> + * stat.c
> + */
> +void cgroup_stat_flush(struct cgroup *cgrp);
> +int cgroup_stat_init(struct cgroup *cgrp);
> +void cgroup_stat_exit(struct cgroup *cgrp);
> +void cgroup_stat_boot(void);
> +
> +/*
>   * namespace.c
>   */
>  extern const struct proc_ns_operations cgroupns_operations;
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index c038ccf..a399a55 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -142,12 +142,14 @@ static struct static_key_true *cgroup_subsys_on_dfl_key[] = {
>  };
>  #undef SUBSYS
>  
> +static DEFINE_PER_CPU(struct cgroup_cpu_stat, cgrp_dfl_root_cpu_stat);
> +
>  /*
>   * The default hierarchy, reserved for the subsystems that are otherwise
>   * unattached - it never has more than a single cgroup, and all tasks are
>   * part of that cgroup.
>   */
> -struct cgroup_root cgrp_dfl_root;
> +struct cgroup_root cgrp_dfl_root = { .cgrp.cpu_stat = &cgrp_dfl_root_cpu_stat };
>  EXPORT_SYMBOL_GPL(cgrp_dfl_root);
>  
>  /*
> @@ -3296,6 +3298,8 @@ static int cgroup_stat_show(struct seq_file *seq, void *v)
>  	seq_printf(seq, "nr_dying_descendants %d\n",
>  		   cgroup->nr_dying_descendants);
>  
> +	cgroup_stat_show_cputime(seq, "cpu.");
> +
>  	return 0;
>  }
>  
> @@ -4466,6 +4470,8 @@ static void css_free_work_fn(struct work_struct *work)
>  			 */
>  			cgroup_put(cgroup_parent(cgrp));
>  			kernfs_put(cgrp->kn);
> +			if (cgroup_on_dfl(cgrp))
> +				cgroup_stat_exit(cgrp);
>  			kfree(cgrp);
>  		} else {
>  			/*
> @@ -4510,6 +4516,9 @@ static void css_release_work_fn(struct work_struct *work)
>  		/* cgroup release path */
>  		trace_cgroup_release(cgrp);
>  
> +		if (cgroup_on_dfl(cgrp))
> +			cgroup_stat_flush(cgrp);
> +
>  		for (tcgrp = cgroup_parent(cgrp); tcgrp;
>  		     tcgrp = cgroup_parent(tcgrp))
>  			tcgrp->nr_dying_descendants--;
> @@ -4696,6 +4705,12 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	if (ret)
>  		goto out_free_cgrp;
>  
> +	if (cgroup_on_dfl(parent)) {
> +		ret = cgroup_stat_init(cgrp);
> +		if (ret)
> +			goto out_cancel_ref;
> +	}
> +
>  	/*
>  	 * Temporarily set the pointer to NULL, so idr_find() won't return
>  	 * a half-baked cgroup.
> @@ -4703,7 +4718,7 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  	cgrp->id = cgroup_idr_alloc(&root->cgroup_idr, NULL, 2, 0, GFP_KERNEL);
>  	if (cgrp->id < 0) {
>  		ret = -ENOMEM;
> -		goto out_cancel_ref;
> +		goto out_stat_exit;
>  	}
>  
>  	init_cgroup_housekeeping(cgrp);
> @@ -4752,6 +4767,9 @@ static struct cgroup *cgroup_create(struct cgroup *parent)
>  
>  	return cgrp;
>  
> +out_stat_exit:
> +	if (cgroup_on_dfl(parent))
> +		cgroup_stat_exit(cgrp);
>  out_cancel_ref:
>  	percpu_ref_exit(&cgrp->self.refcnt);
>  out_free_cgrp:
> @@ -5146,6 +5164,8 @@ int __init cgroup_init(void)
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup_base_files));
>  	BUG_ON(cgroup_init_cftypes(NULL, cgroup1_base_files));
>  
> +	cgroup_stat_boot();
> +
>  	/*
>  	 * The latency of the synchronize_sched() is too high for cgroups,
>  	 * avoid it at the cost of forcing all readers into the slow path.
> diff --git a/kernel/cgroup/stat.c b/kernel/cgroup/stat.c
> new file mode 100644
> index 0000000..19a10b2
> --- /dev/null
> +++ b/kernel/cgroup/stat.c
> @@ -0,0 +1,333 @@
> +#include "cgroup-internal.h"
> +
> +#include <linux/sched/cputime.h>
> +
> +static DEFINE_MUTEX(cgroup_stat_mutex);
> +static DEFINE_PER_CPU(raw_spinlock_t, cgroup_cpu_stat_lock);

If the hierarchy is large enough and the stat data hasn't been read
recently, it may take a while to accumulate all the stat data even for
one cpu in cgroup_stat_flush_locked(). So I think it will make more
sense to use regular spinlock_t instead of raw_spinlock_t.

> +
> +static struct cgroup_cpu_stat *cgroup_cpu_stat(struct cgroup *cgrp, int cpu)
> +{
> +	return per_cpu_ptr(cgrp->cpu_stat, cpu);
> +}
> +
> +/**
> + * cgroup_cpu_stat_updated - keep track of updated cpu_stat
> + * @cgrp: target cgroup
> + * @cpu: cpu on which cpu_stat was updated
> + *
> + * @cgrp's cpu_stat on @cpu was updated.  Put it on the parent's matching
> + * cpu_stat->updated_children list.
> + */
> +static void cgroup_cpu_stat_updated(struct cgroup *cgrp, int cpu)
> +{
> +	raw_spinlock_t *cpu_lock = per_cpu_ptr(&cgroup_cpu_stat_lock, cpu);
> +	struct cgroup *parent;
> +	unsigned long flags;
> +
> +	/*
> +	 * Speculative already-on-list test.  This may race leading to
> +	 * temporary inaccuracies, which is fine.
> +	 *
> +	 * Because @parent's updated_children is terminated with @parent
> +	 * instead of NULL, we can tell whether @cgrp is on the list by
> +	 * testing the next pointer for NULL.
> +	 */
> +	if (cgroup_cpu_stat(cgrp, cpu)->updated_next)
> +		return;
> +
> +	raw_spin_lock_irqsave(cpu_lock, flags);
> +
> +	/* put @cgrp and all ancestors on the corresponding updated lists */
> +	for (parent = cgroup_parent(cgrp); parent;
> +	     cgrp = parent, parent = cgroup_parent(cgrp)) {
> +		struct cgroup_cpu_stat *cstat = cgroup_cpu_stat(cgrp, cpu);
> +		struct cgroup_cpu_stat *pcstat = cgroup_cpu_stat(parent, cpu);
> +
> +		/*
> +		 * Both additions and removals are bottom-up.  If a cgroup
> +		 * is already in the tree, all ancestors are.
> +		 */
> +		if (cstat->updated_next)
> +			break;
> +
> +		cstat->updated_next = pcstat->updated_children;
> +		pcstat->updated_children = cgrp;
> +	}
> +
> +	raw_spin_unlock_irqrestore(cpu_lock, flags);
> +}
> +
> +/**
> + * cgroup_cpu_stat_next_updated - iterate cpu_stat updated tree
> + * @pos: current position
> + * @root: root of the tree to traversal
> + * @cpu: target cpu
> + *
> + * Walks the udpated cpu_stat tree on @cpu from @root.  %NULL @pos starts
> + * the traversal and %NULL return indicates the end.  During traversal,
> + * each returned cgroup is unlinked from the tree.  Must be called with the
> + * matching cgroup_cpu_stat_lock held.
> + *
> + * The only ordering guarantee is that, for a parent and a child pair
> + * covered by a given traversal, if a child is visited, its parent is
> + * guaranteed to be visited afterwards.
> + */
> +static struct cgroup *cgroup_cpu_stat_next_updated(struct cgroup *pos,
> +						   struct cgroup *root, int cpu)

This function is trying to unwind one cgrp from the updated_children and
updated_next linkage. It is somewhat like the opposite of
cgroup_cpu_stat_updated(). I just feel like its name isn't intuitive
enough to convey what it is doing. Maybe use name like
cgroup_cpu_stat_unlink_one() to match cgroup_cpu_stat_flush_one().

Cheers,
Longman