linux-kernel - Help Resource Counters Scale Better (v3)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090809121530.GA5833@balbir.in.ibm.com>
Date:	Sun, 9 Aug 2009 17:45:30 +0530
From:	Balbir Singh <balbir@...ux.vnet.ibm.com>
To:	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Cc:	Andrew Morton <akpm@...ux-foundation.org>, andi.kleen@...el.com,
	Prarit Bhargava <prarit@...hat.com>,
	KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
	"lizf@...fujitsu.com" <lizf@...fujitsu.com>,
	"menage@...gle.com" <menage@...gle.com>,
	Pavel Emelianov <xemul@...nvz.org>,
	"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
	"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Help Resource Counters Scale Better (v3)

Hi, 

Thanks for the detailed review, here is v3 of the patches against
mmotm 6th August. I've documented the TODOs as well. If there are
no major objections, I would like this to be included in mmotm
for more testing. Any test reports on a large machine would be highly
appreciated.

From: Balbir Singh <balbir@...ux.vnet.ibm.com>

Changelog v3->v2

1. Added more documentation and comments
2. Made the check in mem_cgroup_set_limit strict
3. Increased tolerance per cpu to 64KB.
4. Still have the WARN_ON(), I've kept it for debugging
   purposes, may be we should make it a conditional with
   DEBUG_VM

Changelog v2->v1

1. Updated Documentation (cgroups.txt and resource_counters.txt)
2. Added the notion of tolerance to resource counter initialization

Enhancement: For scalability move the resource counter to a percpu counter

This patch changes the usage field of a resource counter to a percpu
counter. The counter is incremented with local irq disabled. The other
fields are still protected by the spin lock for write.

This patch adds a fuzziness factor to hard limit, since the value we read
could be off the original value (by batch value), this can be fixed
by adding a strict/non-strict functionality check. The intention is
to turn of strict checking for root (since we can't set limits on
it anyway).

I tested this patch on my x86_64 box with a regular test for hard
limits and a page fault program.

Signed-off-by: Balbir Singh <balbir@...ux.vnet.ibm.com>
---

 Documentation/cgroups/memory.txt           |   22 ++++++++++++
 Documentation/cgroups/resource_counter.txt |   20 +++++++++--
 include/linux/res_counter.h                |   52 ++++++++++++++++++----------
 kernel/res_counter.c                       |   50 +++++++++++++++++----------
 mm/memcontrol.c                            |   32 +++++++++++++----
 5 files changed, 128 insertions(+), 48 deletions(-)


diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index b871f25..a24dab7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
 d. Provides a double LRU: global memory pressure causes reclaim from the
    global LRU; a cgroup on hitting a limit, reclaims from the per
    cgroup LRU
+   NOTE: One can no longer rely on the exact limit. Since we've moved
+   to using percpu_counters for resource counters, there is always going
+   to be a fuzziness factor depending on the batch value.
 
 Benefits and Purpose of the memory controller
 
@@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
 4. Start reclamation in the background when the limit is
    not yet hit but the usage is getting closer
 
+9. Scalability Tradeoff
+
+As documented in Documentation/cgroups/resource_counter.txt, we've
+moved over to percpu counters for accounting usage. What does this
+mean for the end user?
+
+1. It means better performance
+2. It means that the usage reported does not necessarily reflect
+   realty. Because percpu counters do a sync only so often (see
+   batch value in the code), the value reported might be off the
+   real value by an amount proportional to the specified tolerenace.
+   The tolerance value is currently stored internally.
+
+TODOs
+
+1. Move tolerance to a config option
+2. Add support for strict/non-strict accounting
+
+
 Summary
 
 Overall, the memory controller has been a stable controller and has been
diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
index 95b24d7..a43ea60 100644
--- a/Documentation/cgroups/resource_counter.txt
+++ b/Documentation/cgroups/resource_counter.txt
@@ -12,12 +12,15 @@ to work with it.
 
 1. Crucial parts of the res_counter structure
 
- a. unsigned long long usage
+ a. percpu_counter usage
 
  	The usage value shows the amount of a resource that is consumed
 	by a group at a given time. The units of measurement should be
 	determined by the controller that uses this counter. E.g. it can
 	be bytes, items or any other unit the controller operates on.
+	NOTE: being a percpu_counter, the way to read the correct value
+	at all times makes it unscalable and reading it scalably makes
+	the value a little unreliable :)
 
  b. unsigned long long max_usage
 
@@ -39,16 +42,24 @@ to work with it.
  	The failcnt stands for "failures counter". This is the number of
 	resource allocation attempts that failed.
 
- c. spinlock_t lock
+ e. spinlock_t lock
 
  	Protects changes of the above values.
 
+ f. unsigned long tolerance
+
+	This value is used to keep track of the amount of error that might
+	be tolerated by the resource counter. See the NOTE in (a) above.
+	The tolerance value is per cpu, hence the total error at any time
+	can be nr_cpu_ids * tolerance.
+
 
 
 2. Basic accounting routines
 
  a. void res_counter_init(struct res_counter *rc,
-				struct res_counter *rc_parent)
+				struct res_counter *rc_parent,
+				unsigned long tolerance)
 
  	Initializes the resource counter. As usual, should be the first
 	routine called for a new counter.
@@ -57,6 +68,9 @@ to work with it.
 	child -> parent relationship directly in the res_counter structure,
 	NULL can be used to define no relationship.
 
+	The tolerance is used to control the batching behaviour of percpu
+	counters. Please see details in Section 1, item f above.
+
  c. int res_counter_charge(struct res_counter *rc, unsigned long val,
 				struct res_counter **limit_fail_at)
 
diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
index 731af71..3728c0d 100644
--- a/include/linux/res_counter.h
+++ b/include/linux/res_counter.h
@@ -14,6 +14,7 @@
  */
 
 #include <linux/cgroup.h>
+#include <linux/percpu_counter.h>
 
 /*
  * The core object. the cgroup that wishes to account for some
@@ -23,10 +24,6 @@
 
 struct res_counter {
 	/*
-	 * the current resource consumption level
-	 */
-	unsigned long long usage;
-	/*
 	 * the maximal value of the usage from the counter creation
 	 */
 	unsigned long long max_usage;
@@ -48,6 +45,14 @@ struct res_counter {
 	 */
 	spinlock_t lock;
 	/*
+	 * the current resource consumption level
+	 */
+	struct percpu_counter usage;
+	/*
+	 * Tolerance for the percpu_counter (usage) above
+	 */
+	unsigned long usage_tolerance;
+	/*
 	 * Parent counter, used for hierarchial resource accounting
 	 */
 	struct res_counter *parent;
@@ -98,7 +103,8 @@ enum {
  * helpers for accounting
  */
 
-void res_counter_init(struct res_counter *counter, struct res_counter *parent);
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+			unsigned long usage_tolerance);
 
 /*
  * charge - try to consume more resource.
@@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
 
 static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 {
-	if (cnt->usage < cnt->limit)
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+	if (usage < cnt->limit)
 		return true;
 
 	return false;
@@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
 
 static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 {
-	if (cnt->usage < cnt->soft_limit)
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
+	if (usage < cnt->soft_limit)
 		return true;
 
 	return false;
@@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
 static inline unsigned long long
 res_counter_soft_limit_excess(struct res_counter *cnt)
 {
-	unsigned long long excess;
-	unsigned long flags;
+	unsigned long long excess, usage;
 
-	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= cnt->soft_limit)
+	usage = percpu_counter_read_positive(&cnt->usage);
+	/*
+	 * Not all callers call with irq's disabled, make
+	 * sure we read out something sensible.
+	 */
+	preempt_disable();
+	if (usage <= cnt->soft_limit)
 		excess = 0;
 	else
-		excess = cnt->usage - cnt->soft_limit;
-	spin_unlock_irqrestore(&cnt->lock, flags);
+		excess = usage - cnt->soft_limit;
+	preempt_enable();
 	return excess;
 }
 
@@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
 	bool ret;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cnt->lock, flags);
+	local_irq_save(flags);
 	ret = res_counter_limit_check_locked(cnt);
-	spin_unlock_irqrestore(&cnt->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 
@@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
 	bool ret;
 	unsigned long flags;
 
-	spin_lock_irqsave(&cnt->lock, flags);
+	local_irq_save(flags);
 	ret = res_counter_soft_limit_check_locked(cnt);
-	spin_unlock_irqrestore(&cnt->lock, flags);
+	local_irq_restore(flags);
 	return ret;
 }
 
 static inline void res_counter_reset_max(struct res_counter *cnt)
 {
 	unsigned long flags;
+	unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	cnt->max_usage = cnt->usage;
+	cnt->max_usage = usage;
 	spin_unlock_irqrestore(&cnt->lock, flags);
 }
 
@@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
 		unsigned long long limit)
 {
 	unsigned long flags;
+	unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
 	int ret = -EBUSY;
 
 	spin_lock_irqsave(&cnt->lock, flags);
-	if (cnt->usage <= limit) {
+	if (usage <= limit) {
 		cnt->limit = limit;
 		ret = 0;
 	}
diff --git a/kernel/res_counter.c b/kernel/res_counter.c
index 88faec2..ae83168 100644
--- a/kernel/res_counter.c
+++ b/kernel/res_counter.c
@@ -15,24 +15,34 @@
 #include <linux/uaccess.h>
 #include <linux/mm.h>
 
-void res_counter_init(struct res_counter *counter, struct res_counter *parent)
+void res_counter_init(struct res_counter *counter, struct res_counter *parent,
+			unsigned long usage_tolerance)
 {
 	spin_lock_init(&counter->lock);
+	percpu_counter_init(&counter->usage, 0);
 	counter->limit = RESOURCE_MAX;
 	counter->soft_limit = RESOURCE_MAX;
 	counter->parent = parent;
+	counter->usage_tolerance = usage_tolerance;
 }
 
 int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
 {
-	if (counter->usage + val > counter->limit) {
+	unsigned long long usage;
+
+	usage = percpu_counter_read_positive(&counter->usage);
+	if (usage + val > counter->limit) {
 		counter->failcnt++;
 		return -ENOMEM;
 	}
 
-	counter->usage += val;
-	if (counter->usage > counter->max_usage)
-		counter->max_usage = counter->usage;
+	__percpu_counter_add(&counter->usage, val, nr_cpu_ids *
+				counter->usage_tolerance);
+	if (usage + val > counter->max_usage) {
+		spin_lock(&counter->lock);
+		counter->max_usage = (usage + val);
+		spin_unlock(&counter->lock);
+	}
 	return 0;
 }
 
@@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 		*soft_limit_fail_at = NULL;
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
 		ret = res_counter_charge_locked(c, val);
 		/*
 		 * With soft limits, we return the highest ancestor
@@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 		if (soft_limit_fail_at &&
 			!res_counter_soft_limit_check_locked(c))
 			*soft_limit_fail_at = c;
-		spin_unlock(&c->lock);
 		if (ret < 0) {
 			*limit_fail_at = c;
 			goto undo;
@@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
 	goto done;
 undo:
 	for (u = counter; u != c; u = u->parent) {
-		spin_lock(&u->lock);
 		res_counter_uncharge_locked(u, val);
-		spin_unlock(&u->lock);
 	}
 done:
 	local_irq_restore(flags);
@@ -79,10 +85,13 @@ done:
 
 void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
 {
-	if (WARN_ON(counter->usage < val))
-		val = counter->usage;
+	unsigned long long usage;
+
+	usage = percpu_counter_read_positive(&counter->usage);
+	if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
+		val = usage;
 
-	counter->usage -= val;
+	percpu_counter_sub(&counter->usage, val);
 }
 
 void res_counter_uncharge(struct res_counter *counter, unsigned long val,
@@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
 
 	local_irq_save(flags);
 	for (c = counter; c != NULL; c = c->parent) {
-		spin_lock(&c->lock);
 		if (was_soft_limit_excess)
 			*was_soft_limit_excess =
 				!res_counter_soft_limit_check_locked(c);
 		res_counter_uncharge_locked(c, val);
-		spin_unlock(&c->lock);
 	}
 	local_irq_restore(flags);
 }
@@ -108,8 +115,6 @@ static inline unsigned long long *
 res_counter_member(struct res_counter *counter, int member)
 {
 	switch (member) {
-	case RES_USAGE:
-		return &counter->usage;
 	case RES_MAX_USAGE:
 		return &counter->max_usage;
 	case RES_LIMIT:
@@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 		const char __user *userbuf, size_t nbytes, loff_t *pos,
 		int (*read_strategy)(unsigned long long val, char *st_buf))
 {
-	unsigned long long *val;
+	unsigned long long *val, usage_val;
 	char buf[64], *s;
 
 	s = buf;
-	val = res_counter_member(counter, member);
+	if (member == RES_USAGE) {
+		usage_val = percpu_counter_read_positive(&counter->usage);
+		val = &usage_val;
+	} else
+		val = res_counter_member(counter, member);
 	if (read_strategy)
 		s += read_strategy(*val, s);
 	else
@@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
 
 u64 res_counter_read_u64(struct res_counter *counter, int member)
 {
-	return *res_counter_member(counter, member);
+	if (member == RES_USAGE)
+		return percpu_counter_read_positive(&counter->usage);
+	else
+		return *res_counter_member(counter, member);
 }
 
 int res_counter_memparse_write_strategy(const char *buf,
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 48a38e1..36d46aa 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist);	/* can be hold under cgroup_mutex */
 #define SOFTLIMIT_EVENTS_THRESH (1000)
 
 /*
+ * To help resource counters scale, we take a step back
+ * and allow the counters to be scalable and set a
+ * batch value such that every addition does not cause
+ * global synchronization. The side-effect will be visible
+ * on limit enforcement, where due to this fuzziness,
+ * we will lose out on inforcing a limit when the usage
+ * exceeds the limit. The plan however in the long run
+ * is to allow this value to be controlled. We will
+ * probably add a new control file for it. This will be
+ * moved to a config option later.
+ */
+#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
+
+/*
  * Statistics for memory cgroup.
  */
 enum mem_cgroup_stat_index {
@@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
 	if (free_all)
 		goto try_to_free;
 move_account:
-	while (mem->res.usage > 0) {
+	while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
 		ret = -EBUSY;
 		if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
 			goto out;
@@ -2383,7 +2397,7 @@ try_to_free:
 	lru_add_drain_all();
 	/* try to free all pages in this cgroup */
 	shrink = 1;
-	while (nr_retries && mem->res.usage > 0) {
+	while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
 		int progress;
 
 		if (signal_pending(current)) {
@@ -2401,7 +2415,7 @@ try_to_free:
 	}
 	lru_add_drain();
 	/* try move_account...there may be some *locked* pages. */
-	if (mem->res.usage)
+	if (res_counter_read_u64(&mem->res, RES_USAGE))
 		goto move_account;
 	ret = 0;
 	goto out;
@@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	}
 
 	if (parent && parent->use_hierarchy) {
-		res_counter_init(&mem->res, &parent->res);
-		res_counter_init(&mem->memsw, &parent->memsw);
+		res_counter_init(&mem->res, &parent->res,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
+		res_counter_init(&mem->memsw, &parent->memsw,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
 		/*
 		 * We increment refcnt of the parent to ensure that we can
 		 * safely access it on res_counter_charge/uncharge.
@@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 		 */
 		mem_cgroup_get(parent);
 	} else {
-		res_counter_init(&mem->res, NULL);
-		res_counter_init(&mem->memsw, NULL);
+		res_counter_init(&mem->res, NULL,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
+		res_counter_init(&mem->memsw, NULL,
+			MEM_CGROUP_RES_ERR_TOLERANCE);
 	}
 	mem->last_scanned_child = 0;
 	spin_lock_init(&mem->reclaim_param_lock);

-- 
	Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/