linux-kernel - Re: [PATCH] Memory usage limit notification addition to memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20090413160814.1c335d1d.akpm@linux-foundation.org>
Date:	Mon, 13 Apr 2009 16:08:14 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Dan Malek <dan@...eddedalley.com>
Cc:	linux-kernel@...r.kernel.org, menage@...gle.com,
	dan@...eddedalley.com, containers@...ts.linux-foundation.org
Subject: Re: [PATCH] Memory usage limit notification addition to memcg


Please cc containers@...ts.linux-foundation.org on this sort of thing
so that all the right people get to see it.

On Mon, 13 Apr 2009 18:08:32 -0400
Dan Malek <dan@...eddedalley.com> wrote:

> This patch updates the Memory Controller cgroup to add
> a configurable memory usage limit notification.  The feature
> was presented at the April 2009 Embedded Linux Conference.
> 
> Signed-off-by: Dan Malek <dan@...eddedalley.com>
> ---
>  Documentation/cgroups/mem_notify.txt |  129 +++++++++++++++++++++
>  include/linux/memcontrol.h           |    7 +
>  init/Kconfig                         |    9 ++
>  mm/memcontrol.c                      |  207 ++++++++++++++++++++++++++++++++++
>  4 files changed, 352 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/cgroups/mem_notify.txt
> 
> diff --git a/Documentation/cgroups/mem_notify.txt b/Documentation/cgroups/mem_notify.txt
> new file mode 100644
> index 0000000..72d5c26
> --- /dev/null
> +++ b/Documentation/cgroups/mem_notify.txt
> @@ -0,0 +1,129 @@
> +
> +Memory Limit Notificiation
> +
> +Attempts have been made in the past to provide a mechanism for
> +the notification to processes (task, an address space) when memory
> +usage is approaching a high limit.  The intention is that it gives
> +the application an opportunity to release some memory and continue
> +operation rather than be OOM killed.  The CE Linux Forum requested
> +a more comtemporary implementation, and this is the result.
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller.  Please read memory.txt in this
> +directory to understand its operation before continuing here.
> +
> +1. Operation
> +
> +When a kernel is configured with CGROUP_MEM_NOTIFY, three additional
> +files will appear in the memory resource controller:
> +
> +	memory.notify_limit_percent

We've run into problems in the past where a percentage number is too
coarse on large-memory systems.

Proabably that won't be an issue here, but I invite you to convince us
of this ;)


> +	memory.notify_limit_usage
> +	memory.notify_limit_lowait
> +
> +The notification is based upon reaching a percentage of the memory
> +resource controller limit (memory.limit_in_bytes).  When the controller
> +group is created, the percentage is set to 100.  Any integer percentage
> +may be set by writing to memory.notify_limit_percent, such as:
> +
> +	echo 80 > memory.notify_limit_percent
> +
> +The current integer usage percentage may be read at any time from
> +the memory.notify_limit_usage file.
> +
> +The memory.notify_limit_lowait is a blocking read file.  The read will
> +block until one of four conditions occurs:
> +
> +    - The usage reaches or exceeds the memory.notify_limit_percent
> +    - The memory.notify_limit_lowait file is written with any value (debug)
> +    - A thread is moved to another controller group
> +    - The cgroup is destroyed or forced empty (memory.force_empty)

Does it support select()/poll()/eventfd()/etc?

> +
> +1.1 Example Usage
> +
> +An application must be designed to properly take advantage of this
> +memory limit notification feature.  It is a powerful management component
> +of some operating systems and embedded devices that must provide
> +highly available and reliable computing services.  The application works
> +in conjunction with information provided by the operating system to
> +control limited resource usage.  Since many programmers still think
> +memory is infinite and never check the return value from malloc(), it
> +may come as a surprise that such mechanisms have been utilized long ago.
> +
> +A typical application will be multithreaded, with one thread either
> +polling or waiting for the notification event.  When the event occurs,
> +the thread will take whatever action is appropriate within the application
> +design.  This could be actually running a garbage collection algorithm
> +or to simply signal other processing threads they must do something to
> +reduce their memory usage.  The notification thread will then be required
> +to poll the actual usage until the low limit of its choosing is met,
> +at which time the reclaim of memory can stop and the notification thread
> +will wait for the next event.
> +
> +Internally, the application only needs to fopen("memory.notify_limit_usage" ..)
> +and fopen("memory.notify_limit_lowait" ...), then either poll the former
> +file or block read on the latter file using fread() or fscanf() as desired.
> +
> +2. Configuration
> +
> +Follow the instructions in memory.txt for the configuration and usage of
> +the Memory Resource Controller cgroup.  Once this is created and tasks
> +assigned, use the memory limit notification as described here.
> +
> +The only action that is needed outside of the application waiting or polling
> +is to set the memory.notify_limit_percent.  To set a notification to occur
> +when memory usage of the cgroup reaches or exceeds 80 percent can be
> +simply done:
> +
> +	echo 80 > memory.notify_limit_percent
> +
> +This value may be read or changed at any time.  Writing a lower value once
> +the Memory Resource Controller is in operation may trigger immediate
> +notification if the usage is above the new limit.
> +
> +3. Debug and Testing
> +
> +The design of cgroups makes it easier to perform some debugging or
> +monitoring tasks without modification to the application.  For example,
> +a write of any value to memory.notify_limit_lowait will wake up all
> +threads waiting for notifications regardless of current memory usage.
> +
> +Collecting performance data about the cgroup is also simplified, as
> +no application modifications are necessary.  A separate task can be
> +created that will open and monitor any necessary files of the cgroup
> +(such as current limits, usage and usage percentages and even when
> +notification occurs).  This task can also operate outside of the cgroup,
> +so its memory usage is not charged to the cgroup.
> +
> +4. Design
> +
> +The memory limit notification is a configurable extension to the
> +existing Memory Resource Controller, which operates as described to
> +track and manage the memory of the Control Group.  The Memory Resource
> +Controller will still continue to reclaim memory under pressure
> +of the limits, and may OOM kill tasks within the cgroup according to
> +the OOM Killer configuration.
> +
> +The memory notification limit was chosen as a percentage of the
> +memory in use so the cgroup paramaters may continue to be dynamically
> +modified without the need to modify the notificaton parameters.
> +Otherwise, the notification limit would have to also be computed
> +and modified on any Memory Resource Controller operating parameter change.
> +
> +The cgroup file semantics are not well suited for this type of notificaton
> +mechanism.  While applications may choose to simply poll the current
> +usage at their convenience, it was also desired to have a notification
> +event that would trigger when the usage attained the limit.  The
> +blocking read() was chosen, as it is the only current useful method.
> +This presented the problems of "out of band" notification, when you want
> +to return some exceptional status other than reaching the notification
> +limit.  In the cases listed above, the read() on the memory.notify_limit_lowait
> +file will not block and return "0" for the percentage.  When this occurs,
> +the thread must determine if the task has moved to a new cgroup or if
> +the cgroup has been destroyed.  Due to the usage model of this cgroup,
> +neither is likely to happen during normal operation of a product.
> +
> +Dan Malek <dan@...eddedalley.com>
> +Embedded Alley Solutions, Inc.
> +10 March 2009
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 18146c9..031e5d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -117,6 +117,13 @@ static inline bool mem_cgroup_disabled(void)
>  
>  extern bool mem_cgroup_oom_called(struct task_struct *task);
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +extern void test_and_wakeup_notify(struct mem_cgroup *mcg,
> +				unsigned long long newlimit);
> +extern unsigned long compute_usage_percent(unsigned long long usage,
> +				unsigned long long limit);
> +#endif

Stylistic trick: here, please add

#else
static inline void test_and_wakeup_notify(struct mem_cgroup *mcg,
				unsigned long long newlimit)
{
}

static inline unsigned long compute_usage_percent(unsigned long long usage,
				unsigned long long limit)
{
	return 0;		/* ? */
}
#endif

and then remove the ifdefs from the .c files.

>  #else /* CONFIG_CGROUP_MEM_RES_CTLR */
>  struct mem_cgroup;
>  
> diff --git a/init/Kconfig b/init/Kconfig
> index f2f9b53..97138da 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -588,6 +588,15 @@ config CGROUP_MEM_RES_CTLR
>  	  This config option also selects MM_OWNER config option, which
>  	  could in turn add some fork/exit overhead.
>  
> +config CGROUP_MEM_NOTIFY
> +	bool "Memory Usage Limit Notification"
> +	depends on CGROUP_MEM_RES_CTLR
> +	help
> +	  Provides a memory notification when usage reaches a preset limit.
> +	  It is an extenstion to the memory resource controller, since it
> +	  uses the memory usage accounting of the cgroup to test against
> +	  the notification limit.  (See Documentation/cgroups/mem_notify.txt)
> +
>  config CGROUP_MEM_RES_CTLR_SWAP
>  	bool "Memory Resource Controller Swap Extension(EXPERIMENTAL)"
>  	depends on CGROUP_MEM_RES_CTLR && SWAP && EXPERIMENTAL
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 2fc6d6c..d6367ed 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6,6 +6,10 @@
>   * Copyright 2007 OpenVZ SWsoft Inc
>   * Author: Pavel Emelianov <xemul@...nvz.org>
>   *
> + * Memory Limit Notification update
> + * Copyright 2009 CE Linux Forum and Embedded Alley Solutions, Inc.
> + * Author: Dan Malek <dan@...eddedalley.com>
> + *
>   * This program is free software; you can redistribute it and/or modify
>   * it under the terms of the GNU General Public License as published by
>   * the Free Software Foundation; either version 2 of the License, or
> @@ -180,6 +184,11 @@ struct mem_cgroup {
>  	 * statistics. This must be placed at the end of memcg.
>  	 */
>  	struct mem_cgroup_stat stat;
> +
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	unsigned long notify_limit_percent;
> +	wait_queue_head_t notify_limit_wait;
> +#endif
>  };
>  
>  enum charge_type {
> @@ -934,6 +943,21 @@ static int __mem_cgroup_try_charge(struct mm_struct *mm,
>  
>  	VM_BUG_ON(mem_cgroup_is_obsolete(mem));
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	/* We check on the way in so we don't have to duplicate code
> +	 * in both the normal and error exit path.
> +	 */
> +	if (likely(mem->res.limit != (unsigned long long)LLONG_MAX)) {
> +		unsigned long usage_pct;
> +
> +		usage_pct = compute_usage_percent(mem->res.usage + PAGE_SIZE,
> +								mem->res.limit);
> +		if ((usage_pct >= mem->notify_limit_percent) &&
> +		    waitqueue_active(&mem->notify_limit_wait))
> +			wake_up(&mem->notify_limit_wait);
> +	}
> +#endif

It would be nicer to pull this out into a separate function, I expect.

>  	while (1) {
>  		int ret;
>  		bool noswap = false;
> @@ -1663,6 +1687,13 @@ static int mem_cgroup_resize_limit(struct mem_cgroup *memcg,
>  	int children = mem_cgroup_count_children(memcg);
>  	u64 curusage, oldusage;
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	/* Test and notify ahead of the necessity to free pages, as
> +	 * applications giving up pages may help this reclaim procedure.
> +	 */
> +	test_and_wakeup_notify(memcg, val);
> +#endif

ifdefs-in-c make kernel developers sad.

>  	/*
>  	 * For keeping hierarchical_reclaim simple, how long we should retry
>  	 * is depends on callers. We set our retry-count to be function
> @@ -2215,6 +2246,147 @@ static int mem_cgroup_swappiness_write(struct cgroup *cgrp, struct cftype *cft,
>  	return 0;
>  }
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +#define CGROUP_LOCAL_BUFFER_SIZE 64 /* Would be nice if this was in cgroup.h */
> +
> +/* The resource counters are defined as long long, but few processors
> + * handle 64-bit divisor in hardware, and the software to do it isn't
> + * present in the kernel.  It would be nice if the resource counters were
> + * platform specific configurable typedefs, but for now we'll just divide
> + * down the byte counters by the page size to get 32-bit arithmetic.
> + * With a 4K page size, this will work up to about 16384G resource limit.
> + */
> +unsigned long compute_usage_percent(unsigned long long usage,
> +					unsigned long long limit)

This is a poor choice of identifier for a global symbol.  Please prefix
it with some string which identifies its subsystem.


> +{
> +	unsigned long lim;
> +	unsigned long long usage_pct;
> +
> +	usage_pct = (usage / PAGE_SIZE) * 100;
> +	lim = (unsigned long)(limit / PAGE_SIZE);
> +
> +	do_div(usage_pct, lim);
> +
> +	return (unsigned long)usage_pct;
> +}
> +
> +void test_and_wakeup_notify(struct mem_cgroup *mcg, unsigned long long newlimit)

Ditto.

> +{
> +	unsigned long usage_pct;
> +
> +	/* Check to see if the new limit should cause notification.
> +	*/
> +	usage_pct = compute_usage_percent(mcg->res.usage, newlimit);
> +
> +	if ((usage_pct >= mcg->notify_limit_percent) &&
> +	    waitqueue_active(&mcg->notify_limit_wait))
> +		wake_up(&mcg->notify_limit_wait);
> +}

Also, identifiers such as "newlimit" are a bit unclear because they
don't communicate the units, and (less seriously) they don't
communicate what quantity they are measuring.  Something like
memory_newlimit_bytes would be clearer, although a bit silly.

I _assume_ these things are all operating in units of bytes.  Perhaps
it was pages?  That's my point...

> +static ssize_t notify_limit_read(struct cgroup *cgrp, struct cftype *cft,
> +			       struct file *file,
> +			       char __user *buf, size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> +	int len;
> +
> +	len = sprintf(tmp, "%lu\n", memcg->notify_limit_percent);

The reader has to run around the tree to find out if there's a buffer
overflow here.  scnprintf() would set minds at ease.


> +	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static int notify_limit_write(struct cgroup *cgrp, struct cftype *cft,
> +			    const char *buffer)
> +{
> +	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
> +	unsigned long val;
> +	char *endptr;
> +
> +	val = simple_strtoul(buffer, &endptr, 0);

strict_strtoul() please.  It's stricter.

> +	if (val > 100)
> +		return -EINVAL;
> +
> +	memcg->notify_limit_percent = val;
> +
> +	/* Check to see if the new percentage limit should cause notification.
> +	*/
> +	test_and_wakeup_notify(memcg, memcg->res.limit);
> +
> +	return 0;
> +}
> +
> +static ssize_t notify_limit_usage_read(struct cgroup *cgrp, struct cftype *cft,
> +			       struct file *file,
> +			       char __user *buf, size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> +	unsigned long usage_pct;
> +	int len;
> +
> +	usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> +
> +	len = sprintf(tmp, "%lu\n", usage_pct);

scnprintf()

> +	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}
> +
> +static ssize_t notify_limit_lowait(struct cgroup *cgrp, struct cftype *cft,

Perhaps this function would benefit from a nice comment explaining its
design.  It's the core thing.

Than again, the .txt file has good coverage.

> +			       struct file *file,
> +			       char __user *buf, size_t nbytes,
> +			       loff_t *ppos)
> +{
> +	struct mem_cgroup *mem = mem_cgroup_from_cont(cgrp);
> +	char tmp[CGROUP_LOCAL_BUFFER_SIZE];
> +	unsigned long usage_pct;
> +	int len;
> +	DEFINE_WAIT(notify_lowait);
> +
> +	/* A memory resource usage of zero is a special case that
> +	 * causes us not to sleep.  It normally happens when the
> +	 * cgroup is about to be destroyed, and we don't want someone
> +	 * trying to sleep on a queue that is about to go away.  This
> +	 * condition can also be forced as part of testing.
> +	 */
> +	usage_pct = compute_usage_percent(mem->res.usage, mem->res.limit);
> +	if (likely(mem->res.usage != 0)) {
> +
> +		prepare_to_wait(&mem->notify_limit_wait, &notify_lowait,
> +							TASK_INTERRUPTIBLE);
> +
> +		if (usage_pct < mem->notify_limit_percent) {
> +			schedule();
> +
> +			/* Compute percentage we have now and return it.
> +			*/
> +			usage_pct = compute_usage_percent(mem->res.usage,
> +							mem->res.limit);
> +		}
> +		finish_wait(&mem->notify_limit_wait, &notify_lowait);
> +	}
> +
> +	len = sprintf(tmp, "%lu\n", usage_pct);

scnprintf()

> +	return simple_read_from_buffer(buf, nbytes, ppos, tmp, len);
> +}

What is the behavior of this read when someone sends the process a
signal?  Seems that it will return early, giving userspace a number
which it didn't expect to see.  I guess that's OK.

> +/* This is used to wake up all threads that may be hanging
> + * out waiting for a low memory condition prior to that happening.
> + * Useful for triggering the event to assist with debug of applications.
> + */
> +static int notify_limit_wake_em_up(struct cgroup *cgrp, unsigned int event)
> +{
> +	struct mem_cgroup *mem;
> +
> +	mem = mem_cgroup_from_cont(cgrp);
> +	wake_up(&mem->notify_limit_wait);
> +	return 0;
> +}
> +#endif /* CONFIG_CGROUP_MEM_NOTIFY */
> +
>  
>  static struct cftype mem_cgroup_files[] = {
>  	{
> @@ -2258,6 +2430,22 @@ static struct cftype mem_cgroup_files[] = {
>  		.read_u64 = mem_cgroup_swappiness_read,
>  		.write_u64 = mem_cgroup_swappiness_write,
>  	},
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	{
> +		.name = "notify_limit_percent",
> +		.write_string = notify_limit_write,
> +		.read = notify_limit_read,
> +	},
> +	{
> +		.name = "notify_limit_usage",
> +		.read = notify_limit_usage_read,
> +	},
> +	{
> +		.name = "notify_limit_lowait",
> +		.trigger = notify_limit_wake_em_up,
> +		.read = notify_limit_lowait,
> +	},
> +#endif
>  };
>  
>  #ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
> @@ -2461,6 +2649,11 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
>  	mem->last_scanned_child = 0;
>  	spin_lock_init(&mem->reclaim_param_lock);
>  
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	init_waitqueue_head(&mem->notify_limit_wait);
> +	mem->notify_limit_percent = 100;
> +#endif
> +
>  	if (parent)
>  		mem->swappiness = get_swappiness(parent);
>  	atomic_set(&mem->refcnt, 1);
> @@ -2504,6 +2697,20 @@ static void mem_cgroup_move_task(struct cgroup_subsys *ss,
>  				struct cgroup *old_cont,
>  				struct task_struct *p)
>  {
> +#ifdef CONFIG_CGROUP_MEM_NOTIFY
> +	/* We wake up all notification threads any time a migration takes
> +	 * place.  They will have to check to see if a move is needed to
> +	 * a new cgroup file to wait for notification.
> +	 * This isn't so much a task move as it is an attach.  A thread not
> +	 * a child of an existing task won't have a valid parent, which
> +	 * is necessary to test because it won't have a valid mem_cgroup
> +	 * either.  Which further means it won't have a proper wait queue
> +	 * and we can't do a wakeup.
> +	 */
> +	if (old_cont->parent != NULL)
> +		notify_limit_wake_em_up(old_cont, 0);
> +#endif
> +

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/