[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090312180126.GI10919@redhat.com>
Date: Thu, 12 Mar 2009 14:01:26 -0400
From: Vivek Goyal <vgoyal@...hat.com>
To: Andrew Morton <akpm@...ux-foundation.org>
Cc: nauman@...gle.com, dpshah@...gle.com, lizf@...fujitsu.com,
mikew@...gle.com, fchecconi@...il.com, paolo.valente@...more.it,
jens.axboe@...cle.com, ryov@...inux.co.jp,
fernando@...ellilink.co.jp, s-uchida@...jp.nec.com,
taka@...inux.co.jp, guijianfeng@...fujitsu.com,
arozansk@...hat.com, jmoyer@...hat.com, oz-kernel@...hat.com,
dhaval@...ux.vnet.ibm.com, balbir@...ux.vnet.ibm.com,
linux-kernel@...r.kernel.org,
containers@...ts.linux-foundation.org, menage@...gle.com,
peterz@...radead.org, Andrea Righi <righi.andrea@...il.com>
Subject: Re: [PATCH 01/10] Documentation
On Thu, Mar 12, 2009 at 12:11:46AM -0700, Andrew Morton wrote:
> On Wed, 11 Mar 2009 21:56:46 -0400 Vivek Goyal <vgoyal@...hat.com> wrote:
>
> > +Currently "current" task
> > +is used to determine the cgroup (hence io group) of the request. Down the
> > +line we need to make use of bio-cgroup patches to map delayed writes to
> > +right group.
>
> You handled this problem pretty neatly!
>
> It's always been a BIG problem for all the io-controlling schemes, and
> most of them seem to have "handled" it in the above way :(
>
> But for many workloads, writeback is the majority of the IO and it has
> always been the form of IO which has caused us the worst contention and
> latency problems. So I don't think that we can proceed with _anything_
> until we at least have a convincing plan here.
>
Hi Andrew,
Nauman is already maintaining the bio-cgroup patches (originally from
valinux folks) on top of this patchset for attributing write requests to
correct cgroup. We did not include those in initial posting thinking that
patchest will bloat further.
We can pull in bio-cgroup patches also in this series to attribute writes
to right cgroup.
>
> Also.. there are so many IO controller implementations that I've lost
> track of who is doing what. I do have one private report here that
> Andreas's controller "is incredibly productive for us and has allowed
> us to put twice as many users per server with faster times for all
> users". Which is pretty stunning, although it should be viewed as a
> condemnation of the current code, I'm afraid.
>
I had looked briefly at Andrea's implementation in the past. I will look
again. I had thought that this approach did not get much traction.
Some quick thoughts about this approach though.
- It is not a proportional weight controller. It is more of limiting
bandwidth in absolute numbers for each cgroup on each disk.
So each cgroup will define a rule for each disk in the system mentioning
at what maximum rate that cgroup can issue IO to that disk and throttle
the IO from that cgroup if rate has excedded.
Above requirement can create configuration problems.
- If there are large number of disks in system, per cgroup one shall
have to create rules for each disk. Until and unless admin knows
what applications are in which cgroup and strictly what disk
these applications do IO to and create rules for only those
disks.
- I think problem gets compounded if there is a hierarchy of
logical devices. I think in that case one shall have to create
rules for logical devices and not actual physical devices.
- Because it is not proportional weight distribution, if some
cgroup is not using its planned BW, other group sharing the
disk can not make use of spare BW.
- I think one should know in advance the throughput rate of underlying media
and also know competing applications so that one can statically define
the BW assigned to each cgroup on each disk.
This will be difficult. Effective BW extracted out of a rotational media
is dependent on the seek pattern so one shall have to either try to make
some conservative estimates and try to divide BW (we will not utilize disk
fully) or take some peak numbers and divide BW (cgroup might not get the
maximum rate configured).
- Above problems will comound when one goes for deeper hierarhical
configurations.
I think for renewable resources like disk time, it might be a good idea
to do a proportional weight controller to ensure fairness at the same time
achive best throughput possible.
Andrea, please correct me if I have misunderstood the things.
> So my question is: what is the definitive list of
> proposed-io-controller-implementations and how do I cunningly get all
> you guys to check each others homework? :)
I will try to summarize some of the proposals I am aware of.
- Elevator/IO scheduler modification based IO controllers
- This proposal
- cfq io scheduler based control (Satoshi Uchida, NEC)
- One more cfq based io control (vasily, OpenVZ)
- AS io scheduler based control (Naveen Gupta, Google)
- Io-throttling (Andrea Righi)
- Max Bandwidth Controller
- dm-ioband (valinux)
- Proportional weight IO controller.
- Generic IO controller (Vivek Goyal, RedHat)
- My initial attempt to do proportional division of amount of bio
per cgroup at request queue level. This was inspired from
dm-ioband.
I think this proposal should hopefully meet the requirements as envisoned
by other elevator based IO controller solutions.
dm-ioband
---------
I have briefly looked at dm-ioband also and following were some of the
concerns I had raised in the past.
- Need of a dm device for every device we want to control
- This requirement looks odd. It forces everybody to use dm-tools
and if there are lots of disks in the system, configuation is
pain.
- It does not support hiearhical grouping.
- Possibly can break the assumptions of underlying IO schedulers.
- There is no notion of task classes. So tasks of all the classes
are at same level from resource contention point of view.
The only thing which differentiates them is cgroup weight. Which
does not answer the question that an RT task or RT cgroup should
starve the peer cgroup if need be as RT cgroup should get priority
access.
- Because of FIFO release of buffered bios, it is possible that
task of lower priority gets more IO done than the task of higher
priority.
- Buffering at multiple levels and FIFO dispatch can have more
interesting hard to solve issues.
- Assume there is sequential reader and an aggressive
writer in the cgroup. It might happen that writer
pushed lot of write requests in the FIFO queue first
and then a read request from reader comes. Now it might
happen that cfq does not see this read request for a long
time (if cgroup weight is less) and this writer will
starve the reader in this cgroup.
Even cfq anticipation logic will not help here because
when that first read request actually gets to cfq, cfq might
choose to idle for more read requests to come, but the
agreesive writer might have again flooded the FIFO queue
in the group and cfq will not see subsequent read request
for a long time and will unnecessarily idle for read.
- Task grouping logic
- We already have the notion of cgroup where tasks can be grouped
in hierarhical manner. dm-ioband does not make full use of that
and comes up with own mechansim of grouping tasks (apart from
cgroup). And there are odd ways of specifying cgroup id while
configuring the dm-ioband device.
IMHO, once somebody has created the cgroup hieararchy, any IO
controller logic should be able to internally read that hiearchy
and provide control. There should not be need of any other
configuration utity on top of cgroup.
My RFC patches had tried to get rid of this external
configuration requirement.
- Task and Groups can not be treated at same level.
- Because at any second level solution we are controlling bio
per cgroup and don't have any notion of which task queue bio
belongs to, one can not treat task and group at same level.
What I meant is following.
root
/ | \
1 2 A
/ \
3 4
In dm-ioband approach, at top level tasks 1 and 2 will get 50%
of BW together and group A will get 50%. Ideally along the lines
of cpu controller, I would expect it to be 33% each for task 1
task 2 and group A.
This can create interesting scenarios where assumg task1 is
an RT class task. Now one would expect task 1 get all the BW
possible starving task 2 and group A, but that will not be the
case and task1 will get 50% of BW.
Not that it is critically important but it would probably be
nice if we can maitain same semantics as cpu controller. In
elevator layer solution we can do it at least for CFQ scheduler
as it maintains separate io queue per io context.
This is in general an issue for any 2nd level IO controller which
only accounts for io groups and not for io queues per process.
- We will end copying a lot of code/logic from cfq
- To address many of the concerns like multi class scheduler
we will end up duplicating code of IO scheduler. Why can't
we have a one point hierarchical IO scheduling (This patchset).
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists