linux-kernel - Re: [RFC] [PATCH] cfq-iosched: add cfq group hierarchical scheduling support

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <AANLkTin8zv9GdBnSsU-a2XjeqXrvr0fTNY2ZMTbGxiVd@mail.gmail.com>
Date:	Wed, 1 Sep 2010 10:15:31 -0700
From:	Nauman Rafique <nauman@...gle.com>
To:	Vivek Goyal <vgoyal@...hat.com>
Cc:	Gui Jianfeng <guijianfeng@...fujitsu.com>,
	Jens Axboe <axboe@...nel.dk>, Jeff Moyer <jmoyer@...hat.com>,
	Divyesh Shah <dpshah@...gle.com>,
	Corrado Zoccolo <czoccolo@...il.com>,
	linux kernel mailing list <linux-kernel@...r.kernel.org>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>
Subject: Re: [RFC] [PATCH] cfq-iosched: add cfq group hierarchical scheduling support

On Wed, Sep 1, 2010 at 10:10 AM, Vivek Goyal <vgoyal@...hat.com> wrote:
> On Wed, Sep 01, 2010 at 08:49:26AM -0700, Nauman Rafique wrote:
>> On Wed, Sep 1, 2010 at 1:50 AM, Gui Jianfeng <guijianfeng@...fujitsu.com> wrote:
>> > Vivek Goyal wrote:
>> >> On Tue, Aug 31, 2010 at 08:40:19AM -0700, Nauman Rafique wrote:
>> >>> On Tue, Aug 31, 2010 at 5:57 AM, Vivek Goyal <vgoyal@...hat.com> wrote:
>> >>>> On Tue, Aug 31, 2010 at 08:29:20AM +0800, Gui Jianfeng wrote:
>> >>>>> Vivek Goyal wrote:
>> >>>>>> On Mon, Aug 30, 2010 at 02:50:40PM +0800, Gui Jianfeng wrote:
>> >>>>>>> Hi All,
>> >>>>>>>
>> >>>>>>> This patch enables cfq group hierarchical scheduling.
>> >>>>>>>
>> >>>>>>> With this patch, you can create a cgroup directory deeper than level 1.
>> >>>>>>> Now, I/O Bandwidth is distributed in a hierarchy way. For example:
>> >>>>>>> We create cgroup directories as following(the number represents weight):
>> >>>>>>>
>> >>>>>>>             Root grp
>> >>>>>>>            /       \
>> >>>>>>>        grp_1(100) grp_2(400)
>> >>>>>>>        /    \
>> >>>>>>>   grp_3(200) grp_4(300)
>> >>>>>>>
>> >>>>>>> If grp_2 grp_3 and grp_4 are contending for I/O Bandwidth,
>> >>>>>>> grp_2 will share 80% of total bandwidth.
>> >>>>>>> For sub_groups, grp_3 shares 8%(20% * 40%), grp_4 shares 12%(20% * 60%)
>> >>>>>>>
>> >>>>>>> Design:
>> >>>>>>>   o Each cfq group has its own group service tree.
>> >>>>>>>   o Each cfq group contains a "group schedule entity" (gse) that
>> >>>>>>>     schedules on parent cfq group's service tree.
>> >>>>>>>   o Each cfq group contains a "queue schedule entity"(qse), it
>> >>>>>>>     represents all cfqqs located on this cfq group. It schedules
>> >>>>>>>     on this group's service tree. For the time being, root group
>> >>>>>>>     qse's weight is 1000, and subgroup qse's weight is 500.
>> >>>>>>>   o All gses and qse which belones to a same cfq group schedules
>> >>>>>>>     on the same group service tree.
>> >>>>>> Hi Gui,
>> >>>>>>
>> >>>>>> Thanks for the patch. I have few questions.
>> >>>>>>
>> >>>>>> - So how does the hierarchy look like, w.r.t root group. Something as
>> >>>>>>   follows?
>> >>>>>>
>> >>>>>>
>> >>>>>>                     root
>> >>>>>>                    / | \
>> >>>>>>                  q1  q2 G1
>> >>>>>>
>> >>>>>> Assume there are two processes doin IO in root group and q1 and q2 are
>> >>>>>> cfqq queues for those processes and G1 is the cgroup created by user.
>> >>>>>>
>> >>>>>> If yes, then what algorithm do you use to do scheduling between q1, q2
>> >>>>>> and G1? IOW, currently we have two algorithms operating in CFQ. One for
>> >>>>>> cfqq and other for groups. Group algorithm does not use the logic of
>> >>>>>> cfq_slice_offset().
>> >>>>> Hi Vivek,
>> >>>>>
>> >>>>> This patch doesn't break the original sheduling logic. That is cfqg => st => cfqq.
>> >>>>> If q1 and q2 in root group, I treat q1 and q2 bundle as a queue sched entity, and
>> >>>>> it will schedule on root group service with G1, as following:
>> >>>>>
>> >>>>>                          root group
>> >>>>>                         /         \
>> >>>>>                     qse(q1,q2)    gse(G1)
>> >>>>>
>> >>>> Ok. That's interesting. That raises another question that how hierarchy
>> >>>> should look like. IOW, how queue and groups should be treated in
>> >>>> hierarchy.
>> >>>>
>> >>>> CFS cpu scheduler treats queues and group at the same level. That is as
>> >>>> follows.
>> >>>>
>> >>>>                        root
>> >>>>                        / | \
>> >>>>                       q1 q2 G1
>> >>>>
>> >>>> In the past I had raised this question and Jens and corrado liked treating
>> >>>> queues and group at same level.
>> >>>>
>> >>>> Logically, q1, q2 and G1 are all children of root, so it makes sense to
>> >>>> treat them at same level and not group q1 and q2 in to a single entity and
>> >>>> group.
>> >>>>
>> >>>> One of the possible way forward could be this.
>> >>>>
>> >>>> - Treat queue and group at same level (like CFS)
>> >>>>
>> >>>> - Get rid of cfq_slice_offset() logic. That means without idling on, there
>> >>>>  will be no ioprio difference between cfq queues. I think anyway as of
>> >>>>  today that logic helps in so little situations that I would not mind
>> >>>>  getting rid of it. Just that Jens should agree to it.
>> >>>>
>> >>>> - With this new scheme, it will break the existing semantics of root group
>> >>>>  being at same level as child groups. To avoid that, we can probably
>> >>>>  implement two modes (flat and hierarchical), something similar to what
>> >>>>  memory cgroup controller has done. May be one tunable in root cgroup of
>> >>>>  blkio "use_hierarchy".  By default everything will be in flat mode and
>> >>>>  if user wants hiearchical control, he needs to set user_hierarchy in
>> >>>>  root group.
>> >>> Vivek, may be I am reading you wrong here. But you are first
>> >>> suggesting to add more complexity to treat queues and group at the
>> >>> same level. Then you are suggesting add even more complexity to fix
>> >>> the problems caused by that approach.
>> >>>
>> >>> Why do we need to treat queues and group at the same level? "CFS does
>> >>> it" is not a good argument.
>> >>
>> >> Sure it is not a very good argument but at the same time one would need
>> >> a very good argument that why we should do things differently.
>> >>
>> >> - If a user has mounted cpu and blkio controller together and both the
>> >>   controllers are viewing the same hierarchy differently, then it is
>> >>   odd. We need a good reason that why different arrangement makes sense.
>> >
>> > Hi Vivek，
>> >
>> > Even if we mount cpu and blkio together, to me, it's ok for cpu and blkio
>> > having their own logic, since they are totally different cgroup subsystems.
>> >
>> >>
>> >> - To me, both group and cfq queue are children of root group and it
>> >>   makes sense to treat them independent childrens instead of putting
>> >>   all the queues in one logical group which inherits the weight of
>> >>   parent.
>> >>
>> >> - With this new scheme, I am finding it hard to visualize the hierachy.
>> >>   How do you assign the weights to queue entities of a group. It is more
>> >>   like a invisible group with-in group. We shall have to create new
>> >>   tunable which can speicy the weight for this hidden group.
>> >
>> > For the time being, the root "qse" weight is 1000 and others is 500, they don't
>> > inherit the weight of parent. I was thinking that maybe we can determine the qse
>> > weight in term of the queue number and weight in this group and subgroups.
>> >
>> > Thanks,
>> > Gui
>> >
>> >>
>> >>
>> >> So in summary I am liking the "queue at same level as group" scheme for
>> >> two reasons.
>> >>
>> >> - It is more intutive to visualize and implement. It follows the true
>> >>   hierarchy as seen by cgroup file system.
>> >>
>> >> - CFS has already implemented this scheme. So we need a strong arguemnt
>> >>   to justify why we should not follow the same thing. Especially for
>> >>   the case where user has co-mounted cpu and blkio controller.
>> >>
>> >> - It can achieve the same goal as "hidden group" proposal just by
>> >>   creating a cgroup explicitly and moving all threads in that group.
>> >>
>> >> Why do you think that "hidden group" proposal is better than "treating
>> >> queue at same level as group" ?
>>
>> There are multiple reasons for "hidden group" proposal being a better approach.
>>
>> - "Hidden group" would allow us to keep scheduling queues using the
>> CFQ queue scheduling logic. And does not require any major changes in
>> CFQ. Aren't we already using that approach to deal with queues at the
>> root group?
>
> Currently we are operating in flat mode where all the groups are at
> same level (irrespective their position in cgroup hiearchy).
>
>>
>> - If queues and groups are treated at the same level, queues can end
>> up in root cgroup. And we cannot put an upper bound on the number of
>> those queues. Those queues can consume system resources in proportion
>> to their number, causing the performance of groups to suffer. If we
>> have "hidden group", we can configure it to a small weight, and that
>> would limit the impact these queues in root group can have.
>
> To limit the impact of other queues in cgroup, one can use libcgroup to
> automatically place new threads or tasks into a subgroup.
>
> I understand that kernel doing it by default should help though. It is
> less work in terms of configuration. But I am not sure that's a good
> argument to design kernel functionality. Kernel functionality should be
> pretty generic.
>
> Anyway, how would you assign the weight to the hidden group. What's the
> interface for that? A new cgroup file inside each cgroup? Personally
> I think that's little odd interface. Every group has one hidden group
> where all the queues in that group go and weight of that group can be
> specified by a cgroup file.

I think picking a reasonable default weight at compile time is not
that bad an option, given that threads showing up in the "hidden
group" is an uncommon case.

>
> But anyway, I am not tied to any of the approach. I am just trying to
> make sure that we have put enough thought into it as changing it later
> will be hard.
>
> Vivek
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/