[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090810093229.10db7185.kamezawa.hiroyu@jp.fujitsu.com>
Date: Mon, 10 Aug 2009 09:32:29 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
To: balbir@...ux.vnet.ibm.com
Cc: Andrew Morton <akpm@...ux-foundation.org>, andi.kleen@...el.com,
Prarit Bhargava <prarit@...hat.com>,
KOSAKI Motohiro <kosaki.motohiro@...fujitsu.com>,
"lizf@...fujitsu.com" <lizf@...fujitsu.com>,
"menage@...gle.com" <menage@...gle.com>,
Pavel Emelianov <xemul@...nvz.org>,
"linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
"linux-mm@...ck.org" <linux-mm@...ck.org>
Subject: Re: Help Resource Counters Scale Better (v3)
On Sun, 9 Aug 2009 17:45:30 +0530
Balbir Singh <balbir@...ux.vnet.ibm.com> wrote:
> Hi,
>
> Thanks for the detailed review, here is v3 of the patches against
> mmotm 6th August. I've documented the TODOs as well. If there are
> no major objections, I would like this to be included in mmotm
> for more testing. Any test reports on a large machine would be highly
> appreciated.
>
> From: Balbir Singh <balbir@...ux.vnet.ibm.com>
>
> Changelog v3->v2
>
> 1. Added more documentation and comments
> 2. Made the check in mem_cgroup_set_limit strict
> 3. Increased tolerance per cpu to 64KB.
> 4. Still have the WARN_ON(), I've kept it for debugging
> purposes, may be we should make it a conditional with
> DEBUG_VM
>
Because I'll be absent for a while, I don't give any Reviewed-by or Acked-by, now.
Before leaving, I'd like to write some concerns here.
1. you use res_counter_read_positive() in force_empty. It seems force_empty can
go into infinite loop. plz check. (especially when some pages are freed or swapped-in
in other cpu while force_empry runs.)
2. In near future, we'll see 256 or 1024 cpus on a system, anyway.
Assume 1024cpu system, 64k*1024=64M is a tolerance.
Can't we calculate max-tolerane as following ?
tolerance = min(64k * num_online_cpus(), limit_in_bytes/100);
tolerance /= num_online_cpus();
per_cpu_tolerance = min(16k, tolelance);
I think automatic runtine adjusting of tolerance will be finally necessary,
but above will not be very bad because we can guarantee 1% tolerance.
Thx,
-Kame
> Changelog v2->v1
>
> 1. Updated Documentation (cgroups.txt and resource_counters.txt)
> 2. Added the notion of tolerance to resource counter initialization
>
> Enhancement: For scalability move the resource counter to a percpu counter
>
> This patch changes the usage field of a resource counter to a percpu
> counter. The counter is incremented with local irq disabled. The other
> fields are still protected by the spin lock for write.
>
> This patch adds a fuzziness factor to hard limit, since the value we read
> could be off the original value (by batch value), this can be fixed
> by adding a strict/non-strict functionality check. The intention is
> to turn of strict checking for root (since we can't set limits on
> it anyway).
>
> I tested this patch on my x86_64 box with a regular test for hard
> limits and a page fault program.
>
> Signed-off-by: Balbir Singh <balbir@...ux.vnet.ibm.com>
> ---
>
> Documentation/cgroups/memory.txt | 22 ++++++++++++
> Documentation/cgroups/resource_counter.txt | 20 +++++++++--
> include/linux/res_counter.h | 52 ++++++++++++++++++----------
> kernel/res_counter.c | 50 +++++++++++++++++----------
> mm/memcontrol.c | 32 +++++++++++++----
> 5 files changed, 128 insertions(+), 48 deletions(-)
>
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index b871f25..a24dab7 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -13,6 +13,9 @@ c. Provides *zero overhead* for non memory controller users
> d. Provides a double LRU: global memory pressure causes reclaim from the
> global LRU; a cgroup on hitting a limit, reclaims from the per
> cgroup LRU
> + NOTE: One can no longer rely on the exact limit. Since we've moved
> + to using percpu_counters for resource counters, there is always going
> + to be a fuzziness factor depending on the batch value.
>
> Benefits and Purpose of the memory controller
>
> @@ -422,6 +425,25 @@ NOTE2: It is recommended to set the soft limit always below the hard limit,
> 4. Start reclamation in the background when the limit is
> not yet hit but the usage is getting closer
>
> +9. Scalability Tradeoff
> +
> +As documented in Documentation/cgroups/resource_counter.txt, we've
> +moved over to percpu counters for accounting usage. What does this
> +mean for the end user?
> +
> +1. It means better performance
> +2. It means that the usage reported does not necessarily reflect
> + realty. Because percpu counters do a sync only so often (see
> + batch value in the code), the value reported might be off the
> + real value by an amount proportional to the specified tolerenace.
> + The tolerance value is currently stored internally.
> +
> +TODOs
> +
> +1. Move tolerance to a config option
> +2. Add support for strict/non-strict accounting
> +
> +
> Summary
>
> Overall, the memory controller has been a stable controller and has been
> diff --git a/Documentation/cgroups/resource_counter.txt b/Documentation/cgroups/resource_counter.txt
> index 95b24d7..a43ea60 100644
> --- a/Documentation/cgroups/resource_counter.txt
> +++ b/Documentation/cgroups/resource_counter.txt
> @@ -12,12 +12,15 @@ to work with it.
>
> 1. Crucial parts of the res_counter structure
>
> - a. unsigned long long usage
> + a. percpu_counter usage
>
> The usage value shows the amount of a resource that is consumed
> by a group at a given time. The units of measurement should be
> determined by the controller that uses this counter. E.g. it can
> be bytes, items or any other unit the controller operates on.
> + NOTE: being a percpu_counter, the way to read the correct value
> + at all times makes it unscalable and reading it scalably makes
> + the value a little unreliable :)
>
> b. unsigned long long max_usage
>
> @@ -39,16 +42,24 @@ to work with it.
> The failcnt stands for "failures counter". This is the number of
> resource allocation attempts that failed.
>
> - c. spinlock_t lock
> + e. spinlock_t lock
>
> Protects changes of the above values.
>
> + f. unsigned long tolerance
> +
> + This value is used to keep track of the amount of error that might
> + be tolerated by the resource counter. See the NOTE in (a) above.
> + The tolerance value is per cpu, hence the total error at any time
> + can be nr_cpu_ids * tolerance.
> +
>
>
> 2. Basic accounting routines
>
> a. void res_counter_init(struct res_counter *rc,
> - struct res_counter *rc_parent)
> + struct res_counter *rc_parent,
> + unsigned long tolerance)
>
> Initializes the resource counter. As usual, should be the first
> routine called for a new counter.
> @@ -57,6 +68,9 @@ to work with it.
> child -> parent relationship directly in the res_counter structure,
> NULL can be used to define no relationship.
>
> + The tolerance is used to control the batching behaviour of percpu
> + counters. Please see details in Section 1, item f above.
> +
> c. int res_counter_charge(struct res_counter *rc, unsigned long val,
> struct res_counter **limit_fail_at)
>
> diff --git a/include/linux/res_counter.h b/include/linux/res_counter.h
> index 731af71..3728c0d 100644
> --- a/include/linux/res_counter.h
> +++ b/include/linux/res_counter.h
> @@ -14,6 +14,7 @@
> */
>
> #include <linux/cgroup.h>
> +#include <linux/percpu_counter.h>
>
> /*
> * The core object. the cgroup that wishes to account for some
> @@ -23,10 +24,6 @@
>
> struct res_counter {
> /*
> - * the current resource consumption level
> - */
> - unsigned long long usage;
> - /*
> * the maximal value of the usage from the counter creation
> */
> unsigned long long max_usage;
> @@ -48,6 +45,14 @@ struct res_counter {
> */
> spinlock_t lock;
> /*
> + * the current resource consumption level
> + */
> + struct percpu_counter usage;
> + /*
> + * Tolerance for the percpu_counter (usage) above
> + */
> + unsigned long usage_tolerance;
> + /*
> * Parent counter, used for hierarchial resource accounting
> */
> struct res_counter *parent;
> @@ -98,7 +103,8 @@ enum {
> * helpers for accounting
> */
>
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent);
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> + unsigned long usage_tolerance);
>
> /*
> * charge - try to consume more resource.
> @@ -133,7 +139,8 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>
> static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
> {
> - if (cnt->usage < cnt->limit)
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> + if (usage < cnt->limit)
> return true;
>
> return false;
> @@ -141,7 +148,8 @@ static inline bool res_counter_limit_check_locked(struct res_counter *cnt)
>
> static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
> {
> - if (cnt->usage < cnt->soft_limit)
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
> + if (usage < cnt->soft_limit)
> return true;
>
> return false;
> @@ -157,15 +165,19 @@ static inline bool res_counter_soft_limit_check_locked(struct res_counter *cnt)
> static inline unsigned long long
> res_counter_soft_limit_excess(struct res_counter *cnt)
> {
> - unsigned long long excess;
> - unsigned long flags;
> + unsigned long long excess, usage;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> - if (cnt->usage <= cnt->soft_limit)
> + usage = percpu_counter_read_positive(&cnt->usage);
> + /*
> + * Not all callers call with irq's disabled, make
> + * sure we read out something sensible.
> + */
> + preempt_disable();
> + if (usage <= cnt->soft_limit)
> excess = 0;
> else
> - excess = cnt->usage - cnt->soft_limit;
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + excess = usage - cnt->soft_limit;
> + preempt_enable();
> return excess;
> }
>
> @@ -178,9 +190,9 @@ static inline bool res_counter_check_under_limit(struct res_counter *cnt)
> bool ret;
> unsigned long flags;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> + local_irq_save(flags);
> ret = res_counter_limit_check_locked(cnt);
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + local_irq_restore(flags);
> return ret;
> }
>
> @@ -189,18 +201,19 @@ static inline bool res_counter_check_under_soft_limit(struct res_counter *cnt)
> bool ret;
> unsigned long flags;
>
> - spin_lock_irqsave(&cnt->lock, flags);
> + local_irq_save(flags);
> ret = res_counter_soft_limit_check_locked(cnt);
> - spin_unlock_irqrestore(&cnt->lock, flags);
> + local_irq_restore(flags);
> return ret;
> }
>
> static inline void res_counter_reset_max(struct res_counter *cnt)
> {
> unsigned long flags;
> + unsigned long long usage = percpu_counter_read_positive(&cnt->usage);
>
> spin_lock_irqsave(&cnt->lock, flags);
> - cnt->max_usage = cnt->usage;
> + cnt->max_usage = usage;
> spin_unlock_irqrestore(&cnt->lock, flags);
> }
>
> @@ -217,10 +230,11 @@ static inline int res_counter_set_limit(struct res_counter *cnt,
> unsigned long long limit)
> {
> unsigned long flags;
> + unsigned long long usage = percpu_counter_sum_positive(&cnt->usage);
> int ret = -EBUSY;
>
> spin_lock_irqsave(&cnt->lock, flags);
> - if (cnt->usage <= limit) {
> + if (usage <= limit) {
> cnt->limit = limit;
> ret = 0;
> }
> diff --git a/kernel/res_counter.c b/kernel/res_counter.c
> index 88faec2..ae83168 100644
> --- a/kernel/res_counter.c
> +++ b/kernel/res_counter.c
> @@ -15,24 +15,34 @@
> #include <linux/uaccess.h>
> #include <linux/mm.h>
>
> -void res_counter_init(struct res_counter *counter, struct res_counter *parent)
> +void res_counter_init(struct res_counter *counter, struct res_counter *parent,
> + unsigned long usage_tolerance)
> {
> spin_lock_init(&counter->lock);
> + percpu_counter_init(&counter->usage, 0);
> counter->limit = RESOURCE_MAX;
> counter->soft_limit = RESOURCE_MAX;
> counter->parent = parent;
> + counter->usage_tolerance = usage_tolerance;
> }
>
> int res_counter_charge_locked(struct res_counter *counter, unsigned long val)
> {
> - if (counter->usage + val > counter->limit) {
> + unsigned long long usage;
> +
> + usage = percpu_counter_read_positive(&counter->usage);
> + if (usage + val > counter->limit) {
> counter->failcnt++;
> return -ENOMEM;
> }
>
> - counter->usage += val;
> - if (counter->usage > counter->max_usage)
> - counter->max_usage = counter->usage;
> + __percpu_counter_add(&counter->usage, val, nr_cpu_ids *
> + counter->usage_tolerance);
> + if (usage + val > counter->max_usage) {
> + spin_lock(&counter->lock);
> + counter->max_usage = (usage + val);
> + spin_unlock(&counter->lock);
> + }
> return 0;
> }
>
> @@ -49,7 +59,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
> *soft_limit_fail_at = NULL;
> local_irq_save(flags);
> for (c = counter; c != NULL; c = c->parent) {
> - spin_lock(&c->lock);
> ret = res_counter_charge_locked(c, val);
> /*
> * With soft limits, we return the highest ancestor
> @@ -58,7 +67,6 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
> if (soft_limit_fail_at &&
> !res_counter_soft_limit_check_locked(c))
> *soft_limit_fail_at = c;
> - spin_unlock(&c->lock);
> if (ret < 0) {
> *limit_fail_at = c;
> goto undo;
> @@ -68,9 +76,7 @@ int res_counter_charge(struct res_counter *counter, unsigned long val,
> goto done;
> undo:
> for (u = counter; u != c; u = u->parent) {
> - spin_lock(&u->lock);
> res_counter_uncharge_locked(u, val);
> - spin_unlock(&u->lock);
> }
> done:
> local_irq_restore(flags);
> @@ -79,10 +85,13 @@ done:
>
> void res_counter_uncharge_locked(struct res_counter *counter, unsigned long val)
> {
> - if (WARN_ON(counter->usage < val))
> - val = counter->usage;
> + unsigned long long usage;
> +
> + usage = percpu_counter_read_positive(&counter->usage);
> + if (WARN_ON((usage + counter->usage_tolerance * nr_cpu_ids) < val))
> + val = usage;
>
> - counter->usage -= val;
> + percpu_counter_sub(&counter->usage, val);
> }
>
> void res_counter_uncharge(struct res_counter *counter, unsigned long val,
> @@ -93,12 +102,10 @@ void res_counter_uncharge(struct res_counter *counter, unsigned long val,
>
> local_irq_save(flags);
> for (c = counter; c != NULL; c = c->parent) {
> - spin_lock(&c->lock);
> if (was_soft_limit_excess)
> *was_soft_limit_excess =
> !res_counter_soft_limit_check_locked(c);
> res_counter_uncharge_locked(c, val);
> - spin_unlock(&c->lock);
> }
> local_irq_restore(flags);
> }
> @@ -108,8 +115,6 @@ static inline unsigned long long *
> res_counter_member(struct res_counter *counter, int member)
> {
> switch (member) {
> - case RES_USAGE:
> - return &counter->usage;
> case RES_MAX_USAGE:
> return &counter->max_usage;
> case RES_LIMIT:
> @@ -128,11 +133,15 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
> const char __user *userbuf, size_t nbytes, loff_t *pos,
> int (*read_strategy)(unsigned long long val, char *st_buf))
> {
> - unsigned long long *val;
> + unsigned long long *val, usage_val;
> char buf[64], *s;
>
> s = buf;
> - val = res_counter_member(counter, member);
> + if (member == RES_USAGE) {
> + usage_val = percpu_counter_read_positive(&counter->usage);
> + val = &usage_val;
> + } else
> + val = res_counter_member(counter, member);
> if (read_strategy)
> s += read_strategy(*val, s);
> else
> @@ -143,7 +152,10 @@ ssize_t res_counter_read(struct res_counter *counter, int member,
>
> u64 res_counter_read_u64(struct res_counter *counter, int member)
> {
> - return *res_counter_member(counter, member);
> + if (member == RES_USAGE)
> + return percpu_counter_read_positive(&counter->usage);
> + else
> + return *res_counter_member(counter, member);
> }
>
> int res_counter_memparse_write_strategy(const char *buf,
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 48a38e1..36d46aa 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -58,6 +58,20 @@ static DEFINE_MUTEX(memcg_tasklist); /* can be hold under cgroup_mutex */
> #define SOFTLIMIT_EVENTS_THRESH (1000)
>
> /*
> + * To help resource counters scale, we take a step back
> + * and allow the counters to be scalable and set a
> + * batch value such that every addition does not cause
> + * global synchronization. The side-effect will be visible
> + * on limit enforcement, where due to this fuzziness,
> + * we will lose out on inforcing a limit when the usage
> + * exceeds the limit. The plan however in the long run
> + * is to allow this value to be controlled. We will
> + * probably add a new control file for it. This will be
> + * moved to a config option later.
> + */
> +#define MEM_CGROUP_RES_ERR_TOLERANCE (64 * 1024)
> +
> +/*
> * Statistics for memory cgroup.
> */
> enum mem_cgroup_stat_index {
> @@ -2340,7 +2354,7 @@ static int mem_cgroup_force_empty(struct mem_cgroup *mem, bool free_all)
> if (free_all)
> goto try_to_free;
> move_account:
> - while (mem->res.usage > 0) {
> + while (res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> ret = -EBUSY;
> if (cgroup_task_count(cgrp) || !list_empty(&cgrp->children))
> goto out;
> @@ -2383,7 +2397,7 @@ try_to_free:
> lru_add_drain_all();
> /* try to free all pages in this cgroup */
> shrink = 1;
> - while (nr_retries && mem->res.usage > 0) {
> + while (nr_retries && res_counter_read_u64(&mem->res, RES_USAGE) > 0) {
> int progress;
>
> if (signal_pending(current)) {
> @@ -2401,7 +2415,7 @@ try_to_free:
> }
> lru_add_drain();
> /* try move_account...there may be some *locked* pages. */
> - if (mem->res.usage)
> + if (res_counter_read_u64(&mem->res, RES_USAGE))
> goto move_account;
> ret = 0;
> goto out;
> @@ -3019,8 +3033,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> }
>
> if (parent && parent->use_hierarchy) {
> - res_counter_init(&mem->res, &parent->res);
> - res_counter_init(&mem->memsw, &parent->memsw);
> + res_counter_init(&mem->res, &parent->res,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> + res_counter_init(&mem->memsw, &parent->memsw,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> /*
> * We increment refcnt of the parent to ensure that we can
> * safely access it on res_counter_charge/uncharge.
> @@ -3029,8 +3045,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> */
> mem_cgroup_get(parent);
> } else {
> - res_counter_init(&mem->res, NULL);
> - res_counter_init(&mem->memsw, NULL);
> + res_counter_init(&mem->res, NULL,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> + res_counter_init(&mem->memsw, NULL,
> + MEM_CGROUP_RES_ERR_TOLERANCE);
> }
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> --
> Balbir
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists