linux-kernel - Re: IOPS based scheduler (Was: Re: [PATCH 18/21] blkcg: move blkio_group

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CANejiEU1qAsvogozY3MjZnpcrbYZO4CkRE8s73WGPc_R5LKV9g@mail.gmail.com>
Date:	Wed, 4 Apr 2012 05:35:49 -0700
From:	Shaohua Li <shli@...nel.org>
To:	Tao Ma <tm@....ma>
Cc:	Vivek Goyal <vgoyal@...hat.com>, Tejun Heo <tj@...nel.org>,
	axboe@...nel.dk, ctalbott@...gle.com, rni@...gle.com,
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org,
	containers@...ts.linux-foundation.org
Subject: Re: IOPS based scheduler (Was: Re: [PATCH 18/21] blkcg: move
 blkio_group_conf->weight to cfq)

2012/4/3 Tao Ma <tm@....ma>:
> On 04/04/2012 12:50 AM, Vivek Goyal wrote:
>> On Wed, Apr 04, 2012 at 12:36:24AM +0800, Tao Ma wrote:
>>
>> [..]
>>>> - Can't we just set the slice_idle=0 and "quantum" to some high value
>>>>   say "64" or "128" and achieve similar results to iops based scheduler?
>>> yes, I should say cfq with slice_idle = 0 works well in most cases. But
>>> if it comes to blkcg with ssd, it is really a disaster. You know, cfq
>>> has to choose between different cgroups, so even if you choose 1ms as
>>> the service time for each cgroup(actually in my test, only >2ms can work
>>> reliably). the latency for some requests(which have been sent by the
>>> user while not submitting to the driver) is really too much for the
>>> application. I don't think there is a way to resolve it in cfq.
>>
>> Ok, so now you are saying that CFQ as such is not a problem but blkcg
>> logic in CFQ is an issue.
>>
>> What's the issue there? I think the issue there also is group idling.
>> If you set group_idle=0, that idling will be cut down and switching
>> between groups will be fast. That's a different thing that in the
>> process you will most likely lose service differentiation also for
>> most of the workloads.
> No, group_idle=0 doesn't help. We don't have problem with idling, the
> disk is busy for all the tasks, we just want it to be proportional and
> time endurable.
>>
>>>
>>>>
>>>> In theory, above will cut down on idling and try to provide fairness in
>>>> terms of time. I thought fairness in terms of time is most fair. The
>>>> most common problem is measurement of time is not attributable to
>>>> individual queue in an NCQ hardware. I guess that throws time measurement
>>>> of out the window until and unless we have a better algorithm to measure
>>>> time in NCQ environment.
>>>>
>>>> I guess then we can just replace time with number of requests dispatched
>>>> from a process queue. Allow it to dispatch requests for some time and
>>>> then schedule it out and put it back on service tree and charge it
>>>> according to its weight.
>>> As I have said, in this case, the minimal time(1ms) multiple the group
>>> number is too much for a ssd.
>>>
>>> If we can use iops based scheduler, we can use iops_weight for different
>>> cgroups and switch cgroup according to this number. So all the
>>> applications can have a moderate response time which can be estimated.
>>
>> How iops_weight and switching different than CFQ group scheduling logic?
>> I think shaohua was talking of using similar logic. What would you do
>> fundamentally different so that without idling you will get service
>> differentiation?
> I am thinking of differentiate different groups with iops, so if there
> are 3 groups(the weight are 100, 200, 300) we can let them submit 1 io,
> 2 io and 3 io in a round-robin way. With a intel ssd, every io can be
> finished within 100us. So the maximum latency for one io is about 600us,
> still less than 1ms. But with cfq, if all the cgroups are busy, we have
> to switch between these group in ms which means the maximum latency will
> be 6ms. It is terrible for some applications since they use ssds now.
Yes, with iops based scheduling, we do queue switching for every request.
Doing the same thing between groups is quite straightforward. The only issue
I found is this will introduce more process context switch, this isn't
a big issue
for io bound application, but depends. It cuts latency a lot, which I
guess is more
important for web 2.0 application.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/