[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090708035252.GA3215@balbir.in.ibm.com>
Date: Wed, 8 Jul 2009 09:22:53 +0530
From: Balbir Singh <balbir@...ux.vnet.ibm.com>
To: Vladislav Buzov <vbuzov@...eddedalley.com>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Linux Containers Mailing List
<containers@...ts.linux-foundation.org>,
Dan Malek <dan@...eddedalley.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Paul Menage <menage@...gle.com>,
KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Subject: Re: [PATCH 1/1] Memory usage limit notification addition to memcg
* Vladislav Buzov <vbuzov@...eddedalley.com> [2009-07-07 13:25:10]:
> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification. The feature
> was presented at the April 2009 Embedded Linux Conference.
>
> Signed-off-by: Dan Malek <dan@...eddedalley.com>
> Signed-off-by: Vladislav Buzov <vbuzov@...eddedalley.com>
> ---
> Documentation/cgroups/mem_notify.txt | 140 ++++++++++++++++++++++++++
> include/linux/memcontrol.h | 21 ++++
> init/Kconfig | 9 ++
> mm/memcontrol.c | 178 ++++++++++++++++++++++++++++++++++
> 4 files changed, 348 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/cgroups/mem_notify.txt
>
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..b4f20d0
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,140 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit. The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed. The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory threshold notification is a configurable extension to the
> +existing Memory Resource Controller. Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> + memory.notify_threshold_in_bytes
> + memory.notify_available_in_bytes
> + memory.notify_threshold_lowait
> +
> +The notification is based upon reaching a threshold below the memory
> +resouce controller limit (memory.limit_in_bytes). The threshold
> +represents the minimal number of bytes that should be available under
> +the limit. When the controller group is created, the threshold is set
> +to zero which triggers notification when the memory resource controller
> +limit is reached.
> +
> +The threshold may be set by writing to memory.notify_threshold_in_bytes,
> +such as:
> +
> + echo 10M > memory.notify_threshold_in_bytes
> +
> +The current number of available bytes may be read at any time from
> +the memory.notify_available_in_bytes
> +
> +The memory.notify_threshold_lowait is a blocking read file. The read will
> +block until one of four conditions occurs:
> +
> + - The amount of available memory is equal or less than the threshold
> + defined in memory.notify_threshold_in_bytes
> + - The memory.notify_threshold_lowait file is written with any value (debug)
> + - A thread is moved to another controller group
> + - The cgroup is destroyed or forced empty (memory.force_empty)
> +
> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory threshold notification feature. It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services. The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage. Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event. When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design. This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage. The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to
> +fopen("memory.notify_available_in_bytes" ..) or
> +fopen("memory.notify_threshold_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +Comparing the value returned from either of these read function with the
> +value obtained by reading memory.notify_threshold_in_bytes will be an
> +indication of the amount of memory used over the threshold limit.
Polling is never good (from the power consumption and efficiency
view point), unless by poll you mean select() and wait on events.
Blocked read requires a dedicated thread, adding a select or some
other notification mechanism allows the software to wait on several
events at the same time.
> +
> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup. Once this is created and tasks
> +assigned, use the memory threshold notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_threshold_in_bytes. To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 1 MByte below the limit
> +can be simply done:
> +
> + echo 1M > memory.notify_threshold_in_bytes
> +
> +This value may be read or changed at any time. Writing a higher value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new threshold.
> +
> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application. For example,
> +a write of any value to memory.notify_threshold_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary. A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs). This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory threshold notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group. The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification threshold was chosen as a number of bytes of the
> +memory not in use so the cgroup paramaters may continue to be dynamically
Could you clarify the meaning of "not in use"
> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification threshold would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism. While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the threshold. The
> +blocking read() was chosen, as it is the only current useful method.
Could you please elaborate further, why would other mechanisms not
work? Hint: please see cgroupstats.
> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +threshold. In the cases listed above, the read() on the
> +memory.notify_threshold_lowait file will not block and return "0" for
> +the remaining size. When this occurs, the thread must determine if the task
> +has moved to a new cgroup or if the cgroup has been destroyed. Due to
> +the usage model of this cgroup, neither is likely to happen during normal
> +operation of a product.
> +
> +Dan Malek <dan@...eddedalley.com>
> +Embedded Alley Solutions, Inc.
> +6 July 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index e46a073..78205a3 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -118,6 +118,27 @@ static inline bool mem_cgroup_disabled(void)
>
> extern bool mem_cgroup_oom_called(struct task_struct *task);
> void mem_cgroup_update_mapped_file_stat(struct page *page, int val);
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage, unsigned long long limit);
> +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit);
> +void mem_cgroup_notify_move_task(struct cgroup *old_cont);
> +#else
> +static inline void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
> + unsigned long long usage, unsigned long long limit)
> +{
> +}
> +static inline void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit)
> +{
> +}
> +static inline void mem_cgroup_notify_move_task(struct cgroup *old_cont)
> +{
> +}
> +#endif
> +
> #else /* CONFIG_CGROUP_MEM_RES_CTLR */
> struct mem_cgroup;
>
> diff --git a/init/Kconfig b/init/Kconfig
> index 1ce05a4..fb2f7d5 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -594,6 +594,15 @@ config CGROUP_MEM_RES_CTLR
> This config option also selects MM_OWNER config option, which
> could in turn add some fork/exit overhead.
>
> +config CGROUP_MEM_NOTIFY
> + bool "Memory Usage Limit Notification"
> + depends on CGROUP_MEM_RES_CTLR
> + help
> + Provides a memory notification when usage reaches a preset limit.
> + It is an extenstion to the memory resource controller, since it
> + uses the memory usage accounting of the cgroup to test against
> + the notification limit. (See Documentation/cgroups/mem_notify.txt)
> +
> config CGROUP_MEM_RES_CTLR_SWAP
> bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
> depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e2fa20d..cf04279 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
> * Copyright 2007 OpenVZ SWsoft Inc
> * Author: Pavel Emelianov <xemul@...nvz.org>
> *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <dan@...eddedalley.com>
> + *
> * This program is free software; you can redistribute it and/or modify
> * it under the terms of the GNU General Public License as published by
> * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
> /* set when res.limit == memsw.limit */
> bool memsw_is_minimum;
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + unsigned long long notify_threshold_bytes;
> + wait_queue_head_t notify_threshold_wait;
> +#endif
> +
> /*
> * statistics. This must be placed at the end of memcg.
> */
> @@ -995,6 +1004,13 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>
> VM_BUG_ON(css_is_removed(&mem->css));
>
> + /*
> + * We check on the way in so we don't have to duplicate code
> + * in both the normal and error exit path.
> + */
> + mem_cgroup_notify_test_and_wakeup(mem, mem->res.usage + PAGE_SIZE,
> + mem->res.limit);
> +
I don't think it is a good idea to directly read out mem->res.*
without any protection
> while (1) {
> int ret;
> bool noswap = false;
> @@ -1744,6 +1760,12 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
> u64 curusage, oldusage;
>
> /*
> + * Test and notify ahead of the necessity to free pages, as
> + * applications giving up pages may help this reclaim procedure.
> + */
> + mem_cgroup_notify_new_limit(memcg, val);
> +
> + /*
> * For keeping hierarchical_reclaim simple, how long we should retry
> * is depends on callers. We set our retry-count to be function
> * of # of children which we should visit in this loop.
> @@ -2308,6 +2330,139 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
> return 0;
> }
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +/*
> + * Check if a task exceeded notification threshold set for a memory cgroup.
> + * Wake up waiting notification threads, if any.
> + */
> +void mem_cgroup_notify_test_and_wakeup(struct mem_cgroup *mcg,
Could you please use mem or memcg, since we've been using that as a
standard convention in our code.
> + unsigned long long usage,
> + unsigned long long limit)
> +{
> + if (unlikely(usage == RESOURCE_MAX))
I don't think it is a good idea to use unlikely since it is always
likely for root to be at RESOURCE_MAX. Using likely/unlikely on user
parameters IMHO is not a good idea.
> + return;
> +
> + if ((limit - usage <= mcg->notify_threshold_bytes) &&
> + waitqueue_active(&mcg->notify_threshold_wait))
> + wake_up(&mcg->notify_threshold_wait);
> +}
> +/*
> + * Check if current notification threshold exceeds new memory usage
> + * limit set for a memory cgroup. If so, set threshold to zero to
> + * notify tasks in the group when maximal memory usage is achieved.
> + */
> +void mem_cgroup_notify_new_limit(struct mem_cgroup *mcg,
> + unsigned long long newlimit)
> +{
> + if (newlimit <= mcg->notify_threshold_bytes)
> + mcg->notify_threshold_bytes = 0;
> +
> + mem_cgroup_notify_test_and_wakeup(mcg, mcg->res.usage, newlimit);
> +}
Again, I am confused about the mutual exclusion, what protects the new
values being added.
> +
> +static u64 mem_cgroup_notify_threshold_read(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->notify_threshold_bytes;
> +}
> +
> +static int mem_cgroup_notify_threshold_write(struct cgroup *cgrp,
> + struct cftype *cft,
> + const char *buffer)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + unsigned long long val;
> + int ret;
> +
> + /* This function does all necessary parse...reuse it */
> + ret = res_counter_memparse_write_strategy(buffer, &val);
> + if (ret)
> + return ret;
> +
> + /* Threshold must be lower than usage limit */
> + if (val >= memcg->res.limit)
> + return -EINVAL;
> +
> + memcg->notify_threshold_bytes = val;
> +
> + /* Check to see if the new threshold should cause notification */
> + mem_cgroup_notify_test_and_wakeup(memcg, memcg->res.usage,
> + memcg->res.limit);
> +
> + return 0;
> +}
> +
> +static u64 mem_cgroup_notify_available_read(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> + return memcg->res.limit - memcg->res.usage;
> +}
Please use res_counter abstractions to read mem->res values
> +
> +static u64 mem_cgroup_notify_threshold_lowait(struct cgroup *cgrp,
> + struct cftype *cft)
> +{
> + struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> + unsigned long long available_bytes;
> + DEFINE_WAIT(notify_lowait);
> +
> + /*
> + * A memory resource usage of zero is a special case that
> + * causes us not to sleep. It normally happens when the
> + * cgroup is about to be destroyed, and we don't want someone
> + * trying to sleep on a queue that is about to go away. This
> + * condition can also be forced as part of testing.
> + */
> + available_bytes = mem->res.limit - mem->res.usage;
> + if (likely(mem->res.usage != 0)) {
> +
> + prepare_to_wait(&mem->notify_threshold_wait, ¬ify_lowait,
> + TASK_INTERRUPTIBLE);
> +
> + if (available_bytes > mem->notify_threshold_bytes)
> + schedule();
> +
> + available_bytes = mem->res.limit - mem->res.usage;
> +
> + finish_wait(&mem->notify_threshold_wait, ¬ify_lowait);
> + }
> +
> + return available_bytes;
> +}
> +
> +/*
> + * This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int mem_cgroup_notify_threshold_wake_em_up(struct cgroup *cgrp,
> + unsigned int event)
> +{
> + struct mem_cgroup *mem;
> +
> + mem = mem_cgroup_from_cont(cgrp);
> + wake_up(&mem->notify_threshold_wait);
> + return 0;
> +}
> +
> +/*
> + * We wake up all notification threads any time a migration takes
> + * place. They will have to check to see if a move is needed to
> + * a new cgroup file to wait for notification.
> + * This isn't so much a task move as it is an attach. A thread not
> + * a child of an existing task won't have a valid parent, which
> + * is necessary to test because it won't have a valid mem_cgroup
> + * either. Which further means it won't have a proper wait queue
> + * and we can't do a wakeup.
> + */
> +void mem_cgroup_notify_move_task(struct cgroup *old_cont)
> +{
> + if (old_cont->parent != NULL)
> + mem_cgroup_notify_threshold_wake_em_up(old_cont, 0);
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>
> static struct cftype mem_cgroup_files[] = {
> {
> @@ -2351,6 +2506,22 @@ static struct cftype mem_cgroup_files[] = {
> .read_u64 = mem_cgroup_swappiness_read,
> .write_u64 = mem_cgroup_swappiness_write,
> },
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + {
> + .name = "notify_threshold_in_bytes",
> + .write_string = mem_cgroup_notify_threshold_write,
> + .read_u64 = mem_cgroup_notify_threshold_read,
> + },
> + {
> + .name = "notify_available_in_bytes",
> + .read_u64 = mem_cgroup_notify_available_read,
> + },
> + {
> + .name = "notify_threshold_lowait",
> + .trigger = mem_cgroup_notify_threshold_wake_em_up,
> + .read_u64 = mem_cgroup_notify_threshold_lowait,
> + },
> +#endif
> };
>
> #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2554,6 +2725,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
> mem->last_scanned_child = 0;
> spin_lock_init(&mem->reclaim_param_lock);
>
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> + init_waitqueue_head(&mem->notify_threshold_wait);
> + mem->notify_threshold_bytes = 0;
> +#endif
> +
> if (parent)
> mem->swappiness = get_swappiness(parent);
> atomic_set(&mem->refcnt, 1);
> @@ -2597,6 +2773,8 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
> struct cgroup *old_cont,
> struct task_struct *p)
> {
> + mem_cgroup_notify_move_task(old_cont);
> +
> mutex_lock(&memcg_tasklist);
> /*
> * FIXME: It's better to move charges of this process from old
> --
> 1.5.6.3
>
--
Balbir
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists