[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121211153725.GD5580@redhat.com>
Date: Tue, 11 Dec 2012 10:37:25 -0500
From: Vivek Goyal <vgoyal@...hat.com>
To: Tejun Heo <tj@...nel.org>
Cc: Zhao Shuai <zhaoshuai@...ebsd.org>, axboe@...nel.dk,
ctalbott@...gle.com, rni@...gle.com, linux-kernel@...r.kernel.org,
cgroups@...r.kernel.org, containers@...ts.linux-foundation.org
Subject: Re: performance drop after using blkcg
On Tue, Dec 11, 2012 at 07:14:12AM -0800, Tejun Heo wrote:
> Hello, Vivek.
>
> On Tue, Dec 11, 2012 at 10:02:34AM -0500, Vivek Goyal wrote:
> > cfq_group_served() {
> > if (iops_mode(cfqd))
> > charge = cfqq->slice_dispatch;
> > cfqg->vdisktime += cfq_scale_slice(charge, cfqg);
> > }
> >
> > Isn't it effectively IOPS scheduling. One should get IOPS rate in proportion to
> > their weight (as long as they can throw enough traffic at device to keep
> > it busy). If not, can you please give more details about your proposal.
>
> The problem is that we lose a lot of isolation w/o idling between
> queues or groups. This is because we switch between slices and while
> a slice is in progress only ios belongint to that slice can be issued.
> ie. higher priority cfqgs / cfqqs, after dispatching the ios they have
> ready, lose their slice immmediately. Lower priority slice takes over
> and when hgiher priority ones get ready, they have to wait for the
> lower priority one before submitting the new IOs. In many cases, they
> end up not being able to generate IOs any faster than the ones in
> lower priority cfqqs/cfqgs.
>
> This is becase we switch slices rather than iops.
I am not sure how any of the above problems will go away if we start
scheduling iops.
> We can make cfq
> essentially switch iops by implementing very aggressive preemption but
> I really don't see much point in that.
Yes, this should be easily doable. Once a queue/group is being removed
and is losing its share, just keep track of last vdisktime. When more IO
comes in this group and current group is preempted (if its vdisktime is
greater than one being queued). And new group is probably queued at
the front.
I have experimented with schemes like that but did not see any very
promising resutls. Assume device supports queue depth of 128, and there
is one dependent reader and one writer. If reader goes away and comes
back and preempts low priority writer, in that small time window writer
has dispatched enough requests to introduce read delays. So preemption
helps only so much. I am curious to know how iops based scheduler solve
these issues.
Only way to provide effective isolation seemed to be idling and the
moment we idle we kill the performance. It does not matter whether we
are scheduling time or iops.
> cfq is way too heavy and
> ill-suited for high speed non-rot devices which are becoming more and
> more consistent in terms of iops they can handle.
>
> I think we need something better suited for the maturing non-rot
> devices. They're becoming very different from what cfq was built for
> and we really shouldn't be maintaining several rb trees which need
> full synchronization for each IO. We're doing way too much and it
> just isn't scalable.
I am fine with doing things differently in a different scheduler. But
what I am aruging here is that atleast with CFQ we should be able to
experiment and figure out what works. In CFQ all the code is there and
if this iops based scheduling has merit, one should be able to quickly
experiment and demonstrate how would one do things differently.
To me I have not been able to understand yet that what is iops based
scheduling doing differently. Will we idle there or not. If we idle
we again have performance problems.
So doing things out of CFQ is fine. I am only after understanding the
technical idea which will solve the problem of provinding isolation
as well as fairness without losing throughput. And I have not been
able to get a hang of it yet.
Thanks
Vivek
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists