Signed-off-by: Vivek Goyal Index: linux17/Documentation/controllers/io-controller.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux17/Documentation/controllers/io-controller.txt 2008-11-06 09:12:44.000000000 -0500 @@ -0,0 +1,172 @@ + IO Controller + ============ + +Design +===== +This patchset implements a basic version of proportional weight IO controller. +It is heavily derived from dm-ioband IO controller with one key difference +and that is, there is no separate device mapper driver and there is no +need to create a dm-ioband device on top of every block device which needs +to do the IO control. In this implementation, all the control logic has +been internalized and has been made per request queue. Enabling or disabling +IO control on a block device is just a matter of writing a 0 or 1 in +appropriate sysfs file. + +This is a proportional weight controller and that means various cgroups +are assigned shares and tasks in those cgroups get to dispatch the bio +in proportion to their cgroup share. + +All the contending cgroups are assigned tokens proportionate to their +weights. One token is charged for one sector of IO. Once all the contending +cgroups have consumed their tokens, fresh token allocation takes place and +this is how disk bandwidth allocation proportion to weight is achieved. + +The bigger picture is that all the bios being submitted to a block device +are first inspected by IO controller logic (bio_group_controller()), only if +IO controller has been enabled on that device. The cgroup of the bio is +determined and controller checks if this cgroup has sufficient tokens to +dispatch the bio. If sufficient tokens are there, bio submitting thread +continues to dispatch the bio through normal path otherwise IO controller +buffers the bio and submitting thread returns back. These buffered bios +are dispatched to lower layers later once the associate group (bio group) +has sufficient tokens to dispatch the bios. This delayed dispatching is +done with the help of a worker thread (biogroup). + +IO control can be enabled/disabled dynamically on any of the block device +through sysfs file system. For example, to enable IO control on a device +do following. + +echo 1 > /sys/block/sda/biogroup + +To disable IO control write 0. + +echo 0 > /sys/block/sda/biogroup + +This should be doable for any of the block device in the stack. Currently this +patch places the hooks only for device mapper driver and still need to tweak +md. + +For example, assume there are two cgroups A and B with weights 1024 and 2048 +in the system. Tasks in two cgroups A and B are doing IO to two disks sda and +sdb in the system. A user has enabled IO control on both sda and sdb. Now on +both sda and sdb, tasks in cgroup B will get to use 2/3 of disk BW and +tasks in cgroup A will get to use 1/3 of disk bandwidth, only in case of +contention. If tasks in any of the groups stop doing IO to a particular disk, +task in other group will get to use full disk BW for that duration. + + +HOWTO +==== +- Enable cgroup, memory controller and block IO controller in kernel config + file. + +- Boot into the kernel and mount io controller. + + mount -t cgroup -o bio none /cgroup/bio/ + +- Create two cgroups test1 and test2 + + cd /cgroup/bio + mkdir test1 test2 + +- Allocate weight 4096 to test1 and weight 2048 to test2 + + echo 4096 > /cgroup/bio/test1/bio.shares + echo 2048 > /cgroup/bio/test1/bio.shares + +- Launch "dd" operations in cgroup test1 and test2. + + echo $$ > /cgroup/bio/test1/tasks + dd if=/somefile1 of=/dev/null + echo $$ > /cgroup/bio/test2/tasks + dd if=/somefile2 of=/dev/null + +Job in cgroup test1 should finish before job in cgroup test2. To verify +that "dd" in cgroup test1 got to dispatch more bio as compared to "dd" in +test2, look at "bio.aggregate_tokens" in both the cgroup (At same time). At +any point of time when both the dd's are running, aggregate_tokens in cgroup +test1 should be approximately double of aggregate_tokens in cgroup test2 +(Because weight of cgroup test1 is double of weight of cgroup test2). + +Some Tunables +============= +Some tunables appear in cgroup file system and in sysfs for respective +device for debug and for configuration. Here is a brief description. + +Cgroup Files +============ +bio.shares + Specifies the weight of the cgroup. + +bio.aggregate_tokens + Specifies total number of tokens dispatched by this cgroup. One token + represents one sector of IO. + +bio.jiffies + What was the jiffies values when last bio from this cgroup was released. + +bio.nr_token_slices + How many times this cgroup got the token allocation done from token + slice. We kind of create a token slice and every contending cgroup + gets the pie out of the slice based on the share. + +bio.nr_off_the_tree + How many times this bio group went off the list of contending groups. + We maintain an rb-tree of biogroups contending for IO and token + allocation takes place to these groups regularly. If some group stops + doing IO then it is considered to be idle and removed from the tree + and added back later when group has IO to perform. This file just + counts how many times this bio group went off the tree. + +Sysfs Tunabels +============== +/sys/block/{deice name}/biogroup + Whether IO controller (bio groups) are active on this device or not. + +/sys/block/{deice name}/deftoken + Default number of tokens which are given to a bio group upon start + if all the bio groups were of same weight. token slice is of dynamic + length. So if there are 3 cgroups contending and deftoken is 100 then + token slice lenght will be 100*3 = 300 and now out of this slice + three groups will get the tokens based on their weights. + +/sys/block/{deice name}/idletime + The time after which if a bio group does not generate the bio, it is + considered idle and removed from the rb-tree. Currently by default it + is 8ms. + +/sys/block/{deice name}/newslice_count + How many times new token allocation took place on this queue. + +TODO +==== +- Do extensive testing in various scenarios and do performance optimization + and fix the things where broken. + +- IO schedulers derive context information from "current". This assumption + will be broken if bios are being submitted by a worker thread (biogroup). + Probably we need to put io context pointer in bio itself to get rid of + this dependency. + +- Allocating tokens for per sector of IO is crude approximation and will lead + to unfair bandwidth allocation in case task in cgroup is doing sequential IO + and task in other group is doing random IO. Rik Van Riel, suggested that + probably we should switch to time based scheme. Keep a track of average time + it takes to complete IO from a cgroup and do the allocation accordingly. + +- Currently this controller is dependent on memory controller being enabled. + Try to reduce this coupling. + +ISSUES +====== +- IO controller can buffer the bios if suffcient tokens were not available + at the time of bio submission. Once the tokens are available, these bios + are dispatched to elevator/lower layers in first come first serve manner. + And this has potential to break CFQ where a RT tasks should be able to + dispatch the bio first or a high priority task should be able to release + more bio as compared to low priority task in same cgroup. + + Not sure how to fix it. May be we need to maintain another rb-tree and + keep track of RT tasks and tasks priorities and dispatch accordingly. This + is equivalent of duplicating lots of CFQ logic and not sure how would it + impact AS behaviour. -- -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/