[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4793E047.1000602@users.sourceforge.net>
Date: Mon, 21 Jan 2008 00:59:03 +0100 (MET)
From: Andrea Righi <righiandr@...rs.sourceforge.net>
To: Jens Axboe <jens.axboe@...cle.com>
Cc: Naveen Gupta <ngupta@...gle.com>, Paul Menage <menage@...gle.com>,
Dhaval Giani <dhaval@...ux.vnet.ibm.com>,
Balbir Singh <balbir@...ux.vnet.ibm.com>,
LKML <linux-kernel@...r.kernel.org>,
Pavel Emelyanov <xemul@...nvz.org>
Subject: Re: [PATCH] cgroup: limit block I/O bandwidth
Jens Axboe wrote:
> On Sun, Jan 20 2008, Andrea Righi wrote:
>> Jens Axboe wrote:
>>> Your approach is totally flawed, imho. For instance, you don't want a
>>> process to be able to dirty memory at foo mb/sec but only actually
>>> write them out at bar mb/sec.
>> Right. Actually my problem here is that the processes that write out
>> blocks are different respect to the processes that write bytes in
>> memory, and I would be able to add limits on those processes that are
>> dirtying memory.
>
> That's another reason why you cannot do this on a per-process or group
> basis, because you have no way of mapping back from the io queue path
> which process originally dirtied this memory.
>
>>> The noop-iosched changes are also very buggy. The queue back pointer
>>> breaks reference counting and the task pointer storage assumes the task
>>> will also always be around. That's of course not the case.
>> Yes, this really need a lot of fixes. I simply posted the patch to know
>> if such approach (in general) could have sense or not.
>
> It doesn't need fixes, it needs to be redesigned :-). No amount of
> fixing will make the patch you posted correct, since the approach is
> simply not feasible.
>
>>> IOW, you are doing this at the wrong level.
>>>
>>> What problem are you trying to solve?
>> Limiting block I/O bandwidth for tasks that belong to a generic cgroup,
>> in order to provide a sort of a QoS on block I/O.
>>
>> Anyway, I'm quite new in the wonderful land of the I/O scheduling, so
>> any help is appreciated.
>
> For starters, you want to throttle when queuing IO, not dispatching it.
> If you need to modify IO schedulers, then you are already at the wrong
> level. That doesn't solve the write problem, but reads can be done.
>
> If you want to solve for both read/write(2), then move the code to that
> level. That wont work for eg mmap, though...
>
> And as Balbir notes, the openvz group have been looking at some of these
> problems as well. As has lots of other people btw, you probably want to
> search around a bit and acquaint yourself with some of that work.
>
OK, now I understand, the main problem is that pages can be written to
the block device when information about the real process that touched
those pages in memory isn't available anymore. So, the I/O scheduler is
not the right place to do such limitations. Another approach would be to
just limit the I/O requests/sec for read operations and the dirty
memory/sec for write operations (see below). But this is ugly and not
efficient at all, since it just limits the writes in memory and not in
the actual block devices.
AFAIK openvz supports the per-VE I/O priority (via CFQ), that is great, but
this isn't the same as bandwidth limiting. Anyway I'll look closely at the
openvz work to see how they addressed the problem.
Thanks,
-Andrea
Signed-off-by: Andrea Righi <a.righi@...eca.it>
---
diff -urpN linux-2.6.24-rc8/block/io-throttle.c linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c
--- linux-2.6.24-rc8/block/io-throttle.c 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/io-throttle.c 2008-01-21 00:40:25.000000000 +0100
@@ -0,0 +1,221 @@
+/*
+ * io-throttle.c
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+ * General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public
+ * License along with this program; if not, write to the
+ * Free Software Foundation, Inc., 59 Temple Place - Suite 330,
+ * Boston, MA 021110-1307, USA.
+ *
+ * Copyright (C) 2008 Andrea Righi <a.righi@...eca.it>
+ */
+
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/slab.h>
+#include <linux/gfp.h>
+#include <linux/err.h>
+#include <linux/sched.h>
+#include <linux/fs.h>
+#include <linux/jiffies.h>
+#include <linux/spinlock.h>
+#include <linux/io-throttle.h>
+
+struct iothrottle {
+ struct cgroup_subsys_state css;
+ spinlock_t lock;
+ unsigned long iorate;
+ unsigned long req;
+ unsigned long last_request;
+};
+
+static inline struct iothrottle *cgroup_to_iothrottle(struct cgroup *cont)
+{
+ return container_of(cgroup_subsys_state(cont, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+static inline struct iothrottle *task_to_iothrottle(struct task_struct *task)
+{
+ return container_of(task_subsys_state(task, iothrottle_subsys_id),
+ struct iothrottle, css);
+}
+
+/*
+ * Rules: you can only create a cgroup if:
+ * 1. you are capable(CAP_SYS_ADMIN)
+ * 2. the target cgroup is a descendant of your own cgroup
+ *
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static struct cgroup_subsys_state *iothrottle_create(
+ struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ struct iothrottle *iot;
+
+ if (!capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+
+ if (!cgroup_is_descendant(cont))
+ return ERR_PTR(-EPERM);
+
+ iot = kzalloc(sizeof(struct iothrottle), GFP_KERNEL);
+ if (unlikely(!iot))
+ return ERR_PTR(-ENOMEM);
+
+ spin_lock_init(&iot->lock);
+ iot->last_request = jiffies;
+
+ return &iot->css;
+}
+
+/*
+ * Note: called from kernel/cgroup.c with cgroup_lock() held.
+ */
+static void iothrottle_destroy(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ kfree(cgroup_to_iothrottle(cont));
+}
+
+static ssize_t iothrottle_read(struct cgroup *cont, struct cftype *cft,
+ struct file *file, char __user *buf,
+ size_t nbytes, loff_t *ppos)
+{
+ ssize_t count, ret;
+ unsigned long delta, iorate, req, last_request;
+ struct iothrottle *iot;
+ char *page;
+
+ page = (char *)__get_free_page(GFP_TEMPORARY);
+ if (!page)
+ return -ENOMEM;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ cgroup_unlock();
+ ret = -ENODEV;
+ goto out;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+ spin_lock_irq(&iot->lock);
+
+ delta = (long)jiffies - (long)iot->last_request;
+ iorate = iot->iorate;
+ req = iot->req;
+ last_request = iot->last_request;
+
+ spin_unlock_irq(&iot->lock);
+ cgroup_unlock();
+
+ /* print additional debugging stuff */
+ count = sprintf(page, "bandwidth-max: %lu KiB/sec\n"
+ " requested: %lu bytes\n"
+ " last request: %lu jiffies\n"
+ " delta: %lu jiffies\n",
+ iorate, req, last_request, delta);
+
+ ret = simple_read_from_buffer(buf, nbytes, ppos, page, count);
+
+out:
+ free_page((unsigned long)page);
+ return ret;
+}
+
+static int iothrottle_write_uint(struct cgroup *cont, struct cftype *cft,
+ u64 val)
+{
+ struct iothrottle *iot;
+ int ret = 0;
+
+ cgroup_lock();
+ if (cgroup_is_removed(cont)) {
+ ret = -ENODEV;
+ goto out;
+ }
+
+ iot = cgroup_to_iothrottle(cont);
+
+ spin_lock_irq(&iot->lock);
+ iot->iorate = (unsigned long)val;
+ iot->req = 0;
+ iot->last_request = jiffies;
+ spin_unlock_irq(&iot->lock);
+
+out:
+ cgroup_unlock();
+ return ret;
+}
+
+static struct cftype files[] = {
+ {
+ .name = "bandwidth",
+ .read = iothrottle_read,
+ .write_uint = iothrottle_write_uint,
+ },
+};
+
+static int iothrottle_populate(struct cgroup_subsys *ss, struct cgroup *cont)
+{
+ return cgroup_add_files(cont, ss, files, ARRAY_SIZE(files));
+}
+
+struct cgroup_subsys iothrottle_subsys = {
+ .name = "blockio",
+ .create = iothrottle_create,
+ .destroy = iothrottle_destroy,
+ .populate = iothrottle_populate,
+ .subsys_id = iothrottle_subsys_id,
+};
+
+void cgroup_io_account(size_t bytes)
+{
+ struct iothrottle *iot;
+
+ iot = task_to_iothrottle(current);
+ if (!iot || !iot->iorate)
+ return;
+
+ iot->req += bytes;
+}
+EXPORT_SYMBOL(cgroup_io_account);
+
+void io_throttle(void)
+{
+ struct iothrottle *iot;
+ unsigned long delta, t;
+ long sleep;
+
+ iot = task_to_iothrottle(current);
+ if (!iot || !iot->iorate)
+ return;
+
+ delta = (long)jiffies - (long)iot->last_request;
+ if (!delta)
+ return;
+
+ t = msecs_to_jiffies(iot->req / iot->iorate);
+ if (!t)
+ return;
+
+ sleep = t - delta;
+ if (sleep > 0) {
+ pr_debug("io-throttle: task %p (%s) must sleep %lu jiffies\n",
+ current, current->comm, sleep);
+ schedule_timeout_uninterruptible(sleep);
+ }
+
+ iot->req = 0;
+ iot->last_request = jiffies;
+}
+EXPORT_SYMBOL(io_throttle);
diff -urpN linux-2.6.24-rc8/block/ll_rw_blk.c linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c
--- linux-2.6.24-rc8/block/ll_rw_blk.c 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/ll_rw_blk.c 2008-01-21 00:40:09.000000000 +0100
@@ -31,6 +31,7 @@
#include <linux/blktrace_api.h>
#include <linux/fault-inject.h>
#include <linux/scatterlist.h>
+#include <linux/io-throttle.h>
/*
* for max sense size
@@ -3368,6 +3369,8 @@ void submit_bio(int rw, struct bio *bio)
count_vm_events(PGPGOUT, count);
} else {
task_io_account_read(bio->bi_size);
+ cgroup_io_account(bio->bi_size);
+ io_throttle();
count_vm_events(PGPGIN, count);
}
diff -urpN linux-2.6.24-rc8/block/Makefile linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile
--- linux-2.6.24-rc8/block/Makefile 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/block/Makefile 2008-01-21 00:40:09.000000000 +0100
@@ -12,3 +12,5 @@ obj-$(CONFIG_IOSCHED_CFQ) += cfq-iosched
obj-$(CONFIG_BLK_DEV_IO_TRACE) += blktrace.o
obj-$(CONFIG_BLOCK_COMPAT) += compat_ioctl.o
+
+obj-$(CONFIG_CGROUP_IO_THROTTLE) += io-throttle.o
diff -urpN linux-2.6.24-rc8/fs/buffer.c linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c
--- linux-2.6.24-rc8/fs/buffer.c 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/buffer.c 2008-01-21 00:40:09.000000000 +0100
@@ -41,6 +41,7 @@
#include <linux/bitops.h>
#include <linux/mpage.h>
#include <linux/bit_spinlock.h>
+#include <linux/io-throttle.h>
static int fsync_buffers_list(spinlock_t *lock, struct list_head *list);
@@ -713,12 +714,14 @@ static int __set_page_dirty(struct page
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_account(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ io_throttle();
return 1;
}
diff -urpN linux-2.6.24-rc8/fs/direct-io.c linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c
--- linux-2.6.24-rc8/fs/direct-io.c 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/fs/direct-io.c 2008-01-21 00:40:09.000000000 +0100
@@ -35,6 +35,7 @@
#include <linux/buffer_head.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
+#include <linux/io-throttle.h>
#include <asm/atomic.h>
/*
@@ -667,6 +668,8 @@ submit_page_section(struct dio *dio, str
* Read accounting is performed in submit_bio()
*/
task_io_account_write(len);
+ cgroup_io_account(len);
+ io_throttle();
}
/*
diff -urpN linux-2.6.24-rc8/include/linux/cgroup_subsys.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h
--- linux-2.6.24-rc8/include/linux/cgroup_subsys.h 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/cgroup_subsys.h 2008-01-21 00:40:09.000000000 +0100
@@ -37,3 +37,9 @@ SUBSYS(cpuacct)
/* */
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+SUBSYS(iothrottle)
+#endif
+
+/* */
+
diff -urpN linux-2.6.24-rc8/include/linux/io-throttle.h linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h
--- linux-2.6.24-rc8/include/linux/io-throttle.h 1970-01-01 01:00:00.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/include/linux/io-throttle.h 2008-01-21 00:40:09.000000000 +0100
@@ -0,0 +1,12 @@
+#ifndef IO_THROTTLE_H
+#define IO_THROTTLE_H
+
+#ifdef CONFIG_CGROUP_IO_THROTTLE
+extern void io_throttle(void);
+extern void cgroup_io_account(size_t bytes);
+#else
+static inline void io_throttle(void) { }
+static inline void cgroup_io_account(size_t bytes) { }
+#endif /* CONFIG_CGROUP_IO_THROTTLE */
+
+#endif
diff -urpN linux-2.6.24-rc8/init/Kconfig linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig
--- linux-2.6.24-rc8/init/Kconfig 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/init/Kconfig 2008-01-21 00:40:09.000000000 +0100
@@ -313,6 +313,16 @@ config CGROUP_NS
for instance virtual servers and checkpoint/restart
jobs.
+config CGROUP_IO_THROTTLE
+ bool "Enable cgroup I/O throttling (EXPERIMENTAL)"
+ depends on EXPERIMENTAL && CGROUPS
+ select TASK_IO_ACCOUNTING
+ help
+ This allows to limit the maximum I/O bandwidth for specific
+ cgroup(s).
+
+ Say N if unsure.
+
config CPUSETS
bool "Cpuset support"
depends on SMP && CGROUPS
diff -urpN linux-2.6.24-rc8/mm/page-writeback.c linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c
--- linux-2.6.24-rc8/mm/page-writeback.c 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/page-writeback.c 2008-01-21 00:40:09.000000000 +0100
@@ -34,6 +34,7 @@
#include <linux/syscalls.h>
#include <linux/buffer_head.h>
#include <linux/pagevec.h>
+#include <linux/io-throttle.h>
/*
* The maximum number of pages to writeout in a single bdflush/kupdate
@@ -1014,6 +1015,7 @@ int __set_page_dirty_nobuffers(struct pa
__inc_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
task_io_account_write(PAGE_CACHE_SIZE);
+ cgroup_io_account(PAGE_CACHE_SIZE);
}
radix_tree_tag_set(&mapping->page_tree,
page_index(page), PAGECACHE_TAG_DIRTY);
@@ -1023,6 +1025,7 @@ int __set_page_dirty_nobuffers(struct pa
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
+ io_throttle();
return 1;
}
return 0;
diff -urpN linux-2.6.24-rc8/mm/readahead.c linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c
--- linux-2.6.24-rc8/mm/readahead.c 2008-01-16 05:22:48.000000000 +0100
+++ linux-2.6.24-rc8-cgroup-io-throttling/mm/readahead.c 2008-01-21 00:40:09.000000000 +0100
@@ -16,6 +16,7 @@
#include <linux/task_io_accounting_ops.h>
#include <linux/pagevec.h>
#include <linux/pagemap.h>
+#include <linux/io-throttle.h>
void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -76,6 +77,8 @@ int read_cache_pages(struct address_spac
break;
}
task_io_account_read(PAGE_CACHE_SIZE);
+ cgroup_io_account(PAGE_CACHE_SIZE);
+ io_throttle();
}
return ret;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists