[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4B1C5BC9.3010001@cn.fujitsu.com>
Date: Mon, 07 Dec 2009 09:35:05 +0800
From: Gui Jianfeng <guijianfeng@...fujitsu.com>
To: Vivek Goyal <vgoyal@...hat.com>
CC: linux-kernel@...r.kernel.org, jens.axboe@...cle.com,
nauman@...gle.com, dpshah@...gle.com, lizf@...fujitsu.com,
ryov@...inux.co.jp, fernando@....ntt.co.jp, s-uchida@...jp.nec.com,
taka@...inux.co.jp, jmoyer@...hat.com, righi.andrea@...il.com,
m-ikeda@...jp.nec.com, czoccolo@...il.com, Alan.Brunelle@...com
Subject: Re: Block IO Controller V4
Vivek Goyal wrote:
> On Thu, Dec 03, 2009 at 04:41:50PM +0800, Gui Jianfeng wrote:
>> Vivek Goyal wrote:
>>> On Wed, Dec 02, 2009 at 09:51:36AM +0800, Gui Jianfeng wrote:
>>>> Vivek Goyal wrote:
>>>>> Hi Jens,
>>>>>
>>>>> This is V4 of the Block IO controller patches on top of "for-2.6.33" branch
>>>>> of block tree.
>>>>>
>>>>> A consolidated patch can be found here:
>>>>>
>>>>> http://people.redhat.com/vgoyal/io-controller/blkio-controller/blkio-controller-v4.patch
>>>>>
>>>> Hi Vivek,
>>>>
>>>> It seems this version doesn't work very well for "direct(O_DIRECT) sequence read" mode.
>>>> For example, you can create group A and group B, then assign weight 100 to group A and
>>>> weight 400 to group B, and you run "direct sequence read" workload in group A and B
>>>> simultaneously. Ideally, we should see 1:4 disk time differentiation for group A and B.
>>>> But actually, I see almost 1:2 disk time differentiation for group A and B. I'm looking
>>>> into this issue.
>>>> BTW, V3 works well for this case.
>>> Hi Gui,
>>>
>>> In my testing of 8 fio jobs in 8 cgroups, direct sequential reads seems to
>>> be working fine.
>>>
>>> http://lkml.org/lkml/2009/12/1/367
>>>
>>> I suspect that in some case we choose not to idle on the group and it gets
>>> deleted from service tree hence we loose share. Can you have a look at
>>> blkio.dequeue files. If there are excessive deletions, that will signify
>>> that we are loosing share because we chose not to idle.
>>>
>>> If yes, please also run blktrace to see in what cases we chose not to
>>> idle.
>>>
>>> In V3, I had a stronger check to idle on the group if it is empty using
>>> wait_busy() function. In V4 I have removed that and trying to wait busy
>>> on a queue by extending its slice if it has consumed its allocated slice.
>> Hi Vivek,
>>
>> I ckecked the blktrace output, it seems that io group was deleted all the time,
>> because we don't have group idle any more. I pulled the wait_busy code back to
>> V4, and retest it, problem seems disappeared.
>>
>> So i suggest that we need to retain the wait_busy code.
>
> Hi Gui,
>
> We need to figure out why the existing code is not working on your system.
> In V4, I introduced the functionality to extend the slice by slice_idle
> so that we will arm slice idle timer and wait for new request to come in
> and then expire the queue. Following is the code to extend the slice.
>
> /*
> * If this queue consumed its slice and this is last queue
> * in the group, wait for next request before we expire
> * the queue
> */
> if (cfq_slice_used(cfqq) && cfqq->cfqg->nr_cfqq == 1) {
> cfqq->slice_end = jiffies + cfqd->cfq_slice_idle;
> cfq_mark_cfqq_wait_busy(cfqq);
> }
>
> One loop hole I see is that, I extend the slice only if current slice has
> been used. If if we on the boundary and slice has not been used yet, then
> I will not extend the slice. We also might not arm the timer thinking that
> remaining slice is less than think time of process and that can lead to
> expiry of queue. To rule out this possibility, can you remove following
> code in arm_slice_timer() and try it again.
>
> /*
> * If our average think time is larger than the remaining time
> * slice, then don't idle. This avoids overrunning the allotted
> * time slice.
> */
> if (sample_valid(cic->ttime_samples) &&
> (cfqq->slice_end - jiffies < cic->ttime_mean))
> return;
>
> The other possiblity is that at the request completion time slice has not
> expired hence we don't extend the slice and arm the timer. But then
> select_queue() hits and by that time slice has expired and we expire the
> queue. I thought this will not happen very frequently.
>
> Can you figure out what is happening on your system. Why we are not doing
> wait busy on the queue/group (new queue wait_busy and wait_busy_done
> flags) and instead expiring the queue and hence group.
Hi Vivek,
Sorry for the late reply.
In V4, we don't have wait_busy() in select_queue(), so if there isn't any
request on this queue and no cooperator queue available, this queue will
expire immediately. We don't have a chance to get that queue backlogged
again. So group will get removed frequently.
> You can send your blktrace logs to me also. I can also try figuring out
> what is happening.
I think here is the most significant part of blktrace output for this issue.
8,16 0 4024 0.642072068 3924 Q R 320708977 + 8 [rwio]
8,16 0 4025 0.642078523 3924 G R 320708977 + 8 [rwio]
8,16 0 4026 0.642082632 3924 I R 320708977 + 8 [rwio]
8,16 0 0 0.642084075 0 m N cfq3924S /test1 insert_request
8,16 0 0 0.642087062 0 m N cfq3924S /test1 dispatch_insert
8,16 0 0 0.642088250 0 m N cfq3924S /test1 dispatched a request
8,16 0 0 0.642089242 0 m N cfq3924S /test1 activate rq, drv=1
8,16 0 4027 0.642089573 3924 D R 320708977 + 8 [rwio]
8,16 0 0 0.642185679 0 m N cfq3924S /test1 slice expired t=0 <= I think this happens in select_queue()
8,16 0 0 0.642187132 0 m N cfq3924S /test1 sl_used=60 sect=2056
8,16 0 0 0.642189007 0 m N /test1 served: vt=276536888 min_vt=275308088
8,16 0 0 0.642190265 0 m N cfq3924S /test1 del_from_rr
8,16 0 0 0.642190941 0 m N /test1 del_from_rr group
8,16 0 0 0.642192600 0 m N cfq3925S /test2 set_active
8,16 0 0 0.642194414 0 m N cfq3925S /test2 fifo=(null)
8,16 0 0 0.642195296 0 m N cfq3925S /test2 dispatch_insert
8,16 0 0 0.642196709 0 m N cfq3925S /test2 dispatched a request
8,16 0 0 0.642197737 0 m N cfq3925S /test2 activate rq, drv=2
8,16 0 4028 0.642198102 3924 D R 324900545 + 8 [rwio]
8,16 0 4029 0.642204612 3924 U N [rwio] 2
>
> Thanks
> Vivek
>
>
>
--
Regards
Gui Jianfeng
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists