linux-kernel - Re: Help Resource Counters Scale Better (v3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090810093229.10db7185.kamezawa.hiroyu@jp.fujitsu.com>
Date:	Mon, 10 Aug 2009 09:32:29 +0900
From:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To:	balbir@...ux.vnet.ibm.com
Cc:	Andrew Morton <akpm@...ux-foundation.org>, andi.kleen@...el.com,
	Prarit Bhargava <prarit@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"lizf@...fujitsu.com" <lizf@...fujitsu.com>,
	"menage@...gle.com" <menage@...gle.com>,
	Pavel Emelianov <xemul@...nvz.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: Help Resource Counters Scale Better (v3)

On Sun, 9 Aug 2009 17:45:30 +0530
Balbir Singh <balbir@...ux.vnet.ibm.com> wrote:

> Hi, 
> 
> Thanks for the detailed review, here is v3 of the patches against
> mmotm 6th August. I've documented the TODOs as well. If there are
> no major objections, I would like this to be included in mmotm
> for more testing. Any test reports on a large machine would be highly
> appreciated.
> 
> From: Balbir Singh <balbir@...ux.vnet.ibm.com>
> 
> Changelog v3->v2
> 
> 1. Added more documentation and comments
> 2. Made the check in mem_cgroup_set_limit strict
> 3. Increased tolerance per cpu to 64KB.
> 4. Still have the WARN_ON(), I've kept it for debugging
>    purposes, may be we should make it a conditional with
>    DEBUG_VM
> 
Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.

Before leaving, I'd like to write some concerns here.

1. you use res_counter_read_positive() in force_empty. It seems force_empty can
   go into infinite loop. plz check. (especially when some pages are freed or swapped-in
   in other cpu while force_empry runs.)

2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
   Assume 1024cpu system, 64k*1024=64M is a tolerance.
   Can't we calculate max-tolerane as following ?
  
   tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
   tolerance /= num_online_cpus();
   per_cpu_tolerance = min(16k, tolelance);

   I think automatic runtine adjusting of tolerance will be finally necessary,
   but above will not be very bad because we can guarantee 1% tolerance.

Thx,
-Kame

> Changelog v2->v1
> 
> 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> 2. Added the notion of tolerance to resource counter initialization
> 
> Enhancement: For scalability move the resource counter to a percpu counter
> 
> This patch changes the usage field of a resource counter to a percpu
> counter. The counter is incremented with local irq disabled. The other
> fields are still protected by the spin lock for write.
> 
> This patch adds a fuzziness factor to hard limit, since the value we read
> could be off the original value (by batch value), this can be fixed
> by adding a strict/non-strict functionality check. The intention is
> to turn of strict checking for root (since we can't set limits on
> it anyway).
> 
> I tested this patch on my x86_64 box with a regular test for hard
> limits and a page fault program.
> 
> Signed-off-by: Balbir Singh <balbir@...ux.vnet.ibm.com>
> ---
> 
>  Documentation/cgroups/memory.txt           |   22 ++++++++++++
>  Documentation/cgroups/resource_counter.txt |   20 +++++++++--
>  include/linux/res_counter.h                |   52 ++++++++++++++++++----------
>  kernel/res_counter.c                       |   50 +++++++++++++++++----------
>  mm/memcontrol.c                            |   32 +++++++++++++----
>  5 files changed, 128 insertions(+), 48 deletions(-)
> 
> 
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..a24dab7 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
>  d. Provides a double LRU: global memory pressure causes reclaim from the
>     global LRU; a cgroup on hitting a limit, reclaims from the per
>     cgroup LRU
> +   NOTE: One can no longer rely on the exact limit. Since we've moved
> +   to using percpu_counters for resource counters, there is always going
> +   to be a fuzziness factor depending on the batch value.
>  
>  Benefits and Purpose of the memory controller
>  
> @@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
>  4. Start reclamation in the background when the limit is
>     not yet hit but the usage is getting closer
>  
> +9. Scalability Tradeoff
> +
> +As documented in Documentation/cgroups/resource_counter.txt, we've
> +moved over to percpu counters for accounting usage. What does this
> +mean for the end user?
> +
> +1. It means better performance
> +2. It means that the usage reported does not necessarily reflect
> +   realty. Because percpu counters do a sync only so often (see
> +   batch value in the code), the value reported might be off the
> +   real value by an amount proportional to the specified tolerenace.
> +   The tolerance value is currently stored internally.
> +
> +TODOs
> +
> +1. Move tolerance to a config option
> +2. Add support for strict/non-strict accounting
> +
> +
>  Summary
>  
>  Overall, the memory controller has been a stable controller and has been
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..a43ea60 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -12,12 +12,15 @@ to work with it.
>  
>  1. Crucial parts of the res_counter structure
>  
> - a. unsigned long long usage
> + a. percpu_counter usage
>  
>   	The usage value shows the amount of a resource that is consumed
>  	by a group at a given time. The units of measurement should be
>  	determined by the controller that uses this counter. E.g. it can
>  	be bytes, items or any other unit the controller operates on.
> +	NOTE: being a percpu_counter, the way to read the correct value
> +	at all times makes it unscalable and reading it scalably makes
> +	the value a little unreliable :)
>  
>   b. unsigned long long max_usage
>  
> @@ -39,16 +42,24 @@ to work with it.
>   	The failcnt stands for "failures counter". This is the number of
>  	resource allocation attempts that failed.
>  
> - c. spinlock_t lock
> + e. spinlock_t lock
>  
>   	Protects changes of the above values.
>  
> + f. unsigned long tolerance
> +
> +	This value is used to keep track of the amount of error that might
> +	be tolerated by the resource counter. See the NOTE in (a) above.
> +	The tolerance value is per cpu, hence the total error at any time
> +	can be nr_cpu_ids * tolerance.
> +
>  
>  
>  2. Basic accounting routines
>  
>   a. void res_counter_init(struct res_counter *rc,
> -				struct res_counter *rc_parent)
> +				struct res_counter *rc_parent,
> +				unsigned long tolerance)
>  
>   	Initializes the resource counter. As usual, should be the first
>  	routine called for a new counter.
> @@ -57,6 +68,9 @@ to work with it.
>  	child -> parent relationship directly in the res_counter structure,
>  	NULL can be used to define no relationship.
>  
> +	The tolerance is used to control the batching behaviour of percpu
> +	counters. Please see details in Section 1, item f above.
> +
>   c. int res_counter_charge(struct res_counter *rc, unsigned long val,
>  				struct res_counter **limit_fail_at)
>  
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 731af71..3728c0d 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #include <linux/cgroup.h>
> +#include <linux/percpu_counter.h>
>  
>  /*
>   * The core object. the cgroup that wishes to account for some
> @@ -23,10 +24,6 @@
>  
>  struct res_counter {
>  	/*
> -	 * the current resource consumption level
> -	 */
> -	unsigned long long usage;
> -	/*
>  	 * the maximal value of the usage from the counter creation
>  	 */
>  	unsigned long long max_usage;
> @@ -48,6 +45,14 @@ struct res_counter {
>  	 */
>  	spinlock_t lock;
>  	/*
> +	 * the current resource consumption level
> +	 */
> +	struct percpu_counter usage;
> +	/*
> +	 * Tolerance for the percpu_counter (usage) above
> +	 */
> +	unsigned long usage_tolerance;
> +	/*
>  	 * Parent counter, used for hierarchial resource accounting
>  	 */
>  	struct res_counter *parent;
> @@ -98,7 +103,8 @@ enum {
>   * helpers for accounting
>   */
>  
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> +			unsigned long usage_tolerance);
>  
>  /*
>   * charge - try to consume more resource.
> @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>  
>  static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  {
> -	if (cnt->usage < cnt->limit)
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> +	if (usage < cnt->limit)
>  		return true;
>  
>  	return false;
> @@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>  
>  static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  {
> -	if (cnt->usage < cnt->soft_limit)
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> +	if (usage < cnt->soft_limit)
>  		return true;
>  
>  	return false;
> @@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
>  static inline unsigned long long
>  res_counter_soft_limit_excess(struct res_counter *cnt)
>  {
> -	unsigned long long excess;
> -	unsigned long flags;
> +	unsigned long long excess, usage;
>  
> -	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= cnt->soft_limit)
> +	usage = percpu_counter_read_positive(&cnt->usage);
> +	/*
> +	 * Not all callers call with irq's disabled, make
> +	 * sure we read out something sensible.
> +	 */
> +	preempt_disable();
> +	if (usage <= cnt->soft_limit)
>  		excess = 0;
>  	else
> -		excess = cnt->usage - cnt->soft_limit;
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +		excess = usage - cnt->soft_limit;
> +	preempt_enable();
>  	return excess;
>  }
>  
> @@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
>  	bool ret;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&cnt->lock, flags);
> +	local_irq_save(flags);
>  	ret = res_counter_limit_check_locked(cnt);
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +	local_irq_restore(flags);
>  	return ret;
>  }
>  
> @@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
>  	bool ret;
>  	unsigned long flags;
>  
> -	spin_lock_irqsave(&cnt->lock, flags);
> +	local_irq_save(flags);
>  	ret = res_counter_soft_limit_check_locked(cnt);
> -	spin_unlock_irqrestore(&cnt->lock, flags);
> +	local_irq_restore(flags);
>  	return ret;
>  }
>  
>  static inline void res_counter_reset_max(struct res_counter *cnt)
>  {
>  	unsigned long flags;
> +	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>  
>  	spin_lock_irqsave(&cnt->lock, flags);
> -	cnt->max_usage = cnt->usage;
> +	cnt->max_usage = usage;
>  	spin_unlock_irqrestore(&cnt->lock, flags);
>  }
>  
> @@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
>  		unsigned long long limit)
>  {
>  	unsigned long flags;
> +	unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
>  	int ret = -EBUSY;
>  
>  	spin_lock_irqsave(&cnt->lock, flags);
> -	if (cnt->usage <= limit) {
> +	if (usage <= limit) {
>  		cnt->limit = limit;
>  		ret = 0;
>  	}
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 88faec2..ae83168 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -15,24 +15,34 @@
>  #include <linux/uaccess.h>
>  #include <linux/mm.h>
>  
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> +			unsigned long usage_tolerance)
>  {
>  	spin_lock_init(&counter->lock);
> +	percpu_counter_init(&counter->usage, 0);
>  	counter->limit = RESOURCE_MAX;
>  	counter->soft_limit = RESOURCE_MAX;
>  	counter->parent = parent;
> +	counter->usage_tolerance = usage_tolerance;
>  }
>  
>  int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
>  {
> -	if (counter->usage + val > counter->limit) {
> +	unsigned long long usage;
> +
> +	usage = percpu_counter_read_positive(&counter->usage);
> +	if (usage + val > counter->limit) {
>  		counter->failcnt++;
>  		return -ENOMEM;
>  	}
>  
> -	counter->usage += val;
> -	if (counter->usage > counter->max_usage)
> -		counter->max_usage = counter->usage;
> +	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> +				counter->usage_tolerance);
> +	if (usage + val > counter->max_usage) {
> +		spin_lock(&counter->lock);
> +		counter->max_usage = (usage + val);
> +		spin_unlock(&counter->lock);
> +	}
>  	return 0;
>  }
>  
> @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  		*soft_limit_fail_at = NULL;
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
>  		ret = res_counter_charge_locked(c, val);
>  		/*
>  		 * With soft limits, we return the highest ancestor
> @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  		if (soft_limit_fail_at &&
>  			!res_counter_soft_limit_check_locked(c))
>  			*soft_limit_fail_at = c;
> -		spin_unlock(&c->lock);
>  		if (ret < 0) {
>  			*limit_fail_at = c;
>  			goto undo;
> @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
>  	goto done;
>  undo:
>  	for (u = counter; u != c; u = u->parent) {
> -		spin_lock(&u->lock);
>  		res_counter_uncharge_locked(u, val);
> -		spin_unlock(&u->lock);
>  	}
>  done:
>  	local_irq_restore(flags);
> @@ -79,10 +85,13 @@ done:
>  
>  void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
>  {
> -	if (WARN_ON(counter->usage < val))
> -		val = counter->usage;
> +	unsigned long long usage;
> +
> +	usage = percpu_counter_read_positive(&counter->usage);
> +	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> +		val = usage;
>  
> -	counter->usage -= val;
> +	percpu_counter_sub(&counter->usage, val);
>  }
>  
>  void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>  
>  	local_irq_save(flags);
>  	for (c = counter; c != NULL; c = c->parent) {
> -		spin_lock(&c->lock);
>  		if (was_soft_limit_excess)
>  			*was_soft_limit_excess =
>  				!res_counter_soft_limit_check_locked(c);
>  		res_counter_uncharge_locked(c, val);
> -		spin_unlock(&c->lock);
>  	}
>  	local_irq_restore(flags);
>  }
> @@ -108,8 +115,6 @@ static inline unsigned long long *
>  res_counter_member(struct res_counter *counter, int member)
>  {
>  	switch (member) {
> -	case RES_USAGE:
> -		return &counter->usage;
>  	case RES_MAX_USAGE:
>  		return &counter->max_usage;
>  	case RES_LIMIT:
> @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
>  		const char __user *userbuf, size_t nbytes, loff_t *pos,
>  		int (*read_strategy)(unsigned long long val, char *st_buf))
>  {
> -	unsigned long long *val;
> +	unsigned long long *val, usage_val;
>  	char buf[64], *s;
>  
>  	s = buf;
> -	val = res_counter_member(counter, member);
> +	if (member == RES_USAGE) {
> +		usage_val = percpu_counter_read_positive(&counter->usage);
> +		val = &usage_val;
> +	} else
> +		val = res_counter_member(counter, member);
>  	if (read_strategy)
>  		s += read_strategy(*val, s);
>  	else
> @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
>  
>  u64 res_counter_read_u64(struct res_counter *counter, int member)
>  {
> -	return *res_counter_member(counter, member);
> +	if (member == RES_USAGE)
> +		return percpu_counter_read_positive(&counter->usage);
> +	else
> +		return *res_counter_member(counter, member);
>  }
>  
>  int res_counter_memparse_write_strategy(const char *buf,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a38e1..36d46aa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
>  #define SOFTLIMIT_EVENTS_THRESH (1000)
>  
>  /*
> + * To help resource counters scale, we take a step back
> + * and allow the counters to be scalable and set a
> + * batch value such that every addition does not cause
> + * global synchronization. The side-effect will be visible
> + * on limit enforcement, where due to this fuzziness,
> + * we will lose out on inforcing a limit when the usage
> + * exceeds the limit. The plan however in the long run
> + * is to allow this value to be controlled. We will
> + * probably add a new control file for it. This will be
> + * moved to a config option later.
> + */
> +#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
> +
> +/*
>   * Statistics for memory cgroup.
>   */
>  enum mem_cgroup_stat_index {
> @@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
>  	if (free_all)
>  		goto try_to_free;
>  move_account:
> -	while (mem->res.usage > 0) {
> +	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
>  		ret = -EBUSY;
>  		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
>  			goto out;
> @@ -2383,7 +2397,7 @@ try_to_free:
>  	lru_add_drain_all();
>  	/* try to free all pages in this cgroup */
>  	shrink = 1;
> -	while (nr_retries && mem->res.usage > 0) {
> +	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
>  		int progress;
>  
>  		if (signal_pending(current)) {
> @@ -2401,7 +2415,7 @@ try_to_free:
>  	}
>  	lru_add_drain();
>  	/* try move_account...there may be some *locked* pages. */
> -	if (mem->res.usage)
> +	if (res_counter_read_u64(&mem->res, RES_USAGE))
>  		goto move_account;
>  	ret = 0;
>  	goto out;
> @@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	}
>  
>  	if (parent && parent->use_hierarchy) {
> -		res_counter_init(&mem->res, &parent->res);
> -		res_counter_init(&mem->memsw, &parent->memsw);
> +		res_counter_init(&mem->res, &parent->res,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
> +		res_counter_init(&mem->memsw, &parent->memsw,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
>  		/*
>  		 * We increment refcnt of the parent to ensure that we can
>  		 * safely access it on res_counter_charge/uncharge.
> @@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  		 */
>  		mem_cgroup_get(parent);
>  	} else {
> -		res_counter_init(&mem->res, NULL);
> -		res_counter_init(&mem->memsw, NULL);
> +		res_counter_init(&mem->res, NULL,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
> +		res_counter_init(&mem->memsw, NULL,
> +			MEM_CGROUP_RES_ERR_TOLERANCE);
>  	}
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
> 
> -- 
> 	Balbir
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/