linux-kernel - Re: dm-ioband + bio-cgroup benchmarks

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <48DD1C32.60207@gmail.com>
Date:	Fri, 26 Sep 2008 19:30:26 +0200
From:	Andrea Righi <righi.andrea@...il.com>
To:	Vivek Goyal <vgoyal@...hat.com>
CC:	Hirokazu Takahashi <taka@...inux.co.jp>, ryov@...inux.co.jp,
	linux-kernel@...r.kernel.org, dm-devel@...hat.com,
	containers@...ts.linux-foundation.org,
	virtualization@...ts.linux-foundation.org,
	xen-devel@...ts.xensource.com, fernando@....ntt.co.jp,
	balbir@...ux.vnet.ibm.com, xemul@...nvz.org, agk@...rceware.org,
	jens.axboe@...cle.com
Subject: Re: dm-ioband + bio-cgroup benchmarks

Andrea Righi wrote:
> Andrea Righi wrote:
>> Vivek Goyal wrote:
>> [snip]
>>> Ok, I will give more details of the thought process.
>>>
>>> I was thinking of maintaing an rb-tree per request queue and not an
>>> rb-tree per cgroup. This tree can contain all the bios submitted to that
>>> request queue through __make_request(). Every node in the tree will represent
>>> one cgroup and will contain a list of bios issued from the tasks from that
>>> cgroup.
>>>
>>> Every bio entering the request queue through __make_request() function
>>> first will be queued in one of the nodes in this rb-tree, depending on which
>>> cgroup that bio belongs to.
>>>
>>> Once the bios are buffered in rb-tree, we release these to underlying
>>> elevator depending on the proportionate weight of the nodes/cgroups.
>>>
>>> Some more details which I was trying to implement yesterday.
>>>
>>> There will be one bio_cgroup object per cgroup. This object will contain
>>> many bio_group objects. Each bio_group object will be created for each
>>> request queue where a bio from bio_cgroup is queued. Essentially the idea
>>> is that bios belonging to a cgroup can be on various request queues in the
>>> system. So a single object can not serve the purpose as it can not be on
>>> many rb-trees at the same time.  Hence create one sub object which will keep
>>> track of bios belonging to one cgroup on a particular request queue.
>>>
>>> Each bio_group will contain a list of bios and this bio_group object will
>>> be a node in the rb-tree of request queue. For example. Lets say there are
>>> two request queues in the system q1 and q2 (lets say they belong to /dev/sda
>>> and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both
>>> for /dev/sda and /dev/sdb.
>>>
>>> bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group
>>> objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree
>>> and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of
>>> bios issued by task t1 for /dev/sda and bio_group2 will contain a list of
>>> bios issued by task t1 for /dev/sdb. I thought the same can be extended
>>> for stacked devices also.
>>>   
>>> I am still trying to implementing it and hopefully this is doable idea.
>>> I think at the end of the day it will be something very close to dm-ioband
>>> algorithm just that there will be no lvm driver and no notion of separate
>>> dm-ioband device. 
>> Vivek, thanks for the detailed explanation. Only a comment. I guess, if
>> we don't change also the per-process optimizations/improvements made by
>> some IO scheduler, I think we can have undesirable behaviours.
>>
>> For example: CFQ uses the per-process iocontext to improve fairness
>> between *all* the processes in a system. But it doesn't have the concept
>> that there's a cgroup context on-top-of the processes.
>>
>> So, some optimizations made to guarantee fairness among processes could
>> conflict with algorithms implemented at the cgroup layer. And
>> potentially lead to undesirable behaviours.
>>
>> For example an issue I'm experiencing with my cgroup-io-throttle
>> patchset is that a cgroup can consistently increase the IO rate (always
>> respecting the max limits), simply increasing the number of IO worker
>> tasks respect to another cgroup with a lower number of IO workers. This
>> is probably due to the fact the CFQ tries to give the same amount of
>> "IO time" to all the tasks, without considering that they're organized
>> in cgroup.
> 
> BTW this is why I proposed to use a single shared iocontext for all the
> processes running in the same cgroup. Anyway, this is not the best
> solution, because in this way all the IO requests coming from a cgroup
> will be queued to the same cfq queue. If I'm not wrong in this way we
> would implement noop (FIFO) between tasks belonging to the same cgroup
> and CFQ between cgroups. But, at least for this particular case, we
> would be able to provide fairness among cgroups.

Ah! also have a look at this:

http://download.systemimager.org/~arighi/linux/patches/io-throttle/benchmark/graph/effect-of-per-process-cfq-fairness-on-the-cgroup-context.png

The graph highlights the dependency between the IO rate and the number
of tasks running in a cgroup. For this testcase I've used 2 cgroups:

- cgroup A, with a single task doing IO (large O_DIRECT read stream)
- cgroup B, with a variable number of tasks ranging from 1 to 16 doing
  IO in parallel

If we want to be "fair" the gap of IO performance between the cgroups
should be close to 0.

Using "plain" cfq (red line) the gap of performance increases
incrementing the number of tasks in a cgroup.

Using cgroup-io-throttle on top of cfq (green line) the gap of
performance is lower (the asymptotic curve is due to the bandwidth
capping provided by cgroup-io-throttle).

Using cgroup-io-throttle and a single shared iocontext for each cgroup
(blue line) the gap of performance is really close to 0.

Anyway, I repeat, I don't think this is a wonderful solution, it is just
to highlights this issue and share with you the results of some tests I
did.

-Andrea
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/