linux-kernel - Re: Block IO Controller V4

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B1CBFA7.5090603@cn.fujitsu.com>
Date:	Mon, 07 Dec 2009 16:41:11 +0800
From:	Gui Jianfeng <guijianfeng@...fujitsu.com>
To:	Vivek Goyal <vgoyal@...hat.com>
CC:	linux-kernel@...r.kernel.org, jens.axboe@...cle.com,
	nauman@...gle.com, dpshah@...gle.com, lizf@...fujitsu.com,
	ryov@...inux.co.jp, fernando@....ntt.co.jp, s-uchida@...jp.nec.com,
	taka@...inux.co.jp, jmoyer@...hat.com, righi.andrea@...il.com,
	m-ikeda@...jp.nec.com, czoccolo@...il.com, Alan.Brunelle@...com
Subject: Re: Block IO Controller V4

Gui Jianfeng wrote:
> Vivek Goyal wrote:
>> On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
>>> Vivek Goyal wrote:
>>>> On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
>>>>> Vivek Goyal wrote:
>>>>>> Hi Jens,
>>>>>>
>>>>>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
>>>>>> of block tree.
>>>>>>
>>>>>> A consolidated patch can be found here:
>>>>>>
>>>>>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>>>>>>
>>>>> Hi Vivek,
>>>>>
>>>>> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
>>>>> For example, you can create group A and group B, then assign weight 100 to group A and
>>>>> weight 400 to group B, and you run "direct sequence read" workload in group A and B 
>>>>> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B. 
>>>>> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
>>>>> into this issue.
>>>>> BTW, V3 works well for this case.
>>>> Hi Gui,
>>>>
>>>> In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
>>>> be working fine.
>>>>
>>>> http://lkml.org/lkml/2009/12/1/367
>>>>
>>>> I suspect that in some case we choose not to idle on the group and it gets
>>>> deleted from service tree hence we loose share. Can you have a look at
>>>> blkio.dequeue files. If there are excessive deletions, that will signify
>>>> that we are loosing share because we chose not to idle.
>>>>
>>>> If yes, please also run blktrace to see in what cases we chose not to
>>>> idle.
>>>>
>>>> In V3, I had a stronger check to idle on the group if it is empty using
>>>> wait_busy() function. In V4 I have removed that and trying to wait busy
>>>> on a queue by extending its slice if it has consumed its allocated slice.
>>> Hi Vivek,
>>>
>>> I ckecked the blktrace output, it seems that io group was deleted all the time,
>>> because we don't have group idle any more. I pulled the wait_busy code back to
>>> V4, and retest it, problem seems disappeared.
>>>
>>> So i suggest that we need to retain the wait_busy code.
>> Hi Gui,
>>
>> We need to figure out why the existing code is not working on your system.
>> In V4, I introduced the functionality to extend the slice by slice_idle
>> so that we will arm slice idle timer and wait for new request to come in
>> and then expire the queue. Following is the code to extend the slice.
>>
>>                 /*
>>                  * If this queue consumed its slice and this is last queue
>>                  * in the group, wait for next request before we expire
>>                  * the queue
>>                  */
>>                 if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
>>                         cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
>>                         cfq_mark_cfqq_wait_busy(cfqq);
>>                 }
>>
>> One loop hole I see is that, I extend the slice only if current slice has
>> been used. If if we on the boundary and slice has not been used yet, then
>> I will not extend the slice. We also might not arm the timer thinking that
>> remaining slice is less than think time of process and that can lead to
>> expiry of queue. To rule out this possibility, can you remove following
>> code in arm_slice_timer() and try it again.
>>
>>         /*
>>          * If our average think time is larger than the remaining time
>>          * slice, then don't idle. This avoids overrunning the allotted
>>          * time slice.
>>          */
>>         if (sample_valid(cic->ttime_samples) &&
>>             (cfqq->slice_end - jiffies < cic->ttime_mean))
>>                 return;
>>
>> The other possiblity is that at the request completion time slice has not
>> expired hence we don't extend the slice and arm the timer. But then
>> select_queue() hits and by that time slice has expired and we expire the
>> queue. I thought this will not happen very frequently.
>>
>> Can you figure out what is happening on your system. Why we are not doing
>> wait busy on the queue/group (new queue wait_busy and wait_busy_done
>> flags) and instead expiring the queue and hence group.
> 
> Hi Vivek,
> 
> Sorry for the late reply.
> In V4, we don't have wait_busy() in select_queue(), so if there isn't any 
> request on this queue and no cooperator queue available, this queue will
> expire immediately. We don't have a chance to get that queue backlogged
> again. So group will get removed frequently.

  Please ignore the above.
  I confirm that cfqq is expired because of using up time slice.

Thanks
Gui

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/