[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49EE0DF1.6000502@vlnb.net>
Date: Tue, 21 Apr 2009 22:18:25 +0400
From: Vladislav Bolkhovitin <vst@...b.net>
To: Wu Fengguang <wfg@...ux.intel.com>
CC: Jens Axboe <jens.axboe@...cle.com>, Jeff Moyer <jmoyer@...hat.com>,
"Vitaly V. Bursov" <vitalyb@...enet.dn.ua>,
linux-kernel@...r.kernel.org, linux-nfs@...r.kernel.org,
lukasz.jurewicz@...il.com
Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases
Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>> Here are the conclusions from tests:
>>
>> 1. Making all IO threads work in the same IO context with CFQ (vanilla
>> RA and default RA size) brings near 100% link utilization on single
>> stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there
>> is 100% improvement of CFQ over deadline. With 2 read streams CFQ has
>> ever more advantage: >400% (23MB/s vs 5MB/s).
>
> The ideal 2-stream throughput should be >60MB/s, so I guess there are
> still room of improvements for the CFQ's 23MB/s?
Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K
we were able to get ~40MB/s, see the previous e-mail and below.
> The one fact I cannot understand is that SCST seems to breaking up the
> client side 64K reads into server side 4K reads(above readahead layer).
> But I remember you told me that SCST don't do NFS rsize style split-ups.
> Is this a bug? The 4K read size is too small to be CPU/network friendly...
> Where are the split-up and re-assemble done? On the client side or
> internal to the server?
This is on the client's side. See the target's log in the attachment.
Here is the summary of commands data sizes, came to the server, for "dd
if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:
4K 11
8K 0
16K 0
32K 0
64K 0
128K 81
256K 8
512K 0
1024K 0
2048K 0
4096K 0
There's a way too many 4K requests. Apparently, the requests submission
path isn't optimal.
Actually, this is another question I wanted to rise from the very
beginning.
>> 6. Unexpected result. In case, when ll IO threads work in the same IO
>> context with CFQ increasing RA size *decreases* throughput. I think this
>> is, because RA requests performed as single big READ requests, while
>> requests coming from remote clients are much smaller in size (up to
>> 256K), so, when the read by RA data transferred to the remote client on
>> 100MB/s speed, the backstorage media gets rotated a bit, so the next
>> read request must wait the rotation latency (~0.1ms on 7200RPM). This is
>> well conforms with (3) above, when context RA has 40% advantage over
>> vanilla RA with default RA, but much smaller with higher RA.
>
> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...
That doesn't matter, because new request from the client won't come
until all data for the previous one transferred to it. And that transfer
is done on a very *finite* speed.
>> Bottom line IMHO conclusions:
>>
>> 1. Context RA should be considered after additional examination to
>> replace current RA algorithm in the kernel
>
> That's my plan to push context RA to mainline. And thank you very much
> for providing and testing out a real world application for it!
You're welcome!
>> 2. It would be better to increase default RA size to 1024K
>
> That's a long wish to increase the default RA size. However I have a
> vague feeling that it would be better to first make the lower layers
> more smart on max_sectors_kb granularity request splitting and batching.
Can you elaborate more on that, please?
>> *AND* one of the following:
>>
>> 3.1. All RA requests should be split in smaller requests with size up to
>> 256K, which should not be merged with any other request
>
> Are you referring to max_sectors_kb?
Yes
> What's your max_sectors_kb and nr_requests? Something like
>
> grep -r . /sys/block/sda/queue/
Default: 512 and 128 correspondingly.
>> OR
>>
>> 3.2. New RA requests should be sent before the previous one completed to
>> don't let the storage device rotate too far to need full rotation to
>> serve the next request.
>
> Linus has a mmap readahead cleanup patch that can do this. It
> basically replaces a {find_lock_page(); readahead();} sequence into
> {find_get_page(); readahead(); lock_page();}.
>
> I'll try to push that patch into mainline.
Good!
>> I like suggestion 3.1 a lot more, since it should be simple to implement
>> and has the following 2 positive side effects:
>>
>> 1. It would allow to minimize negative effect of higher RA size on the
>> I/O delay latency by allowing CFQ to switch to too long waiting
>> requests, when necessary.
>>
>> 2. It would allow better requests pipelining, which is very important to
>> minimize uplink latency for synchronous requests (i.e. with only one IO
>> request at time, next request issued, when the previous one completed).
>> You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware
>> recommends for maximum performance set max_sectors_kb as low as *64K*
>> with 16MB RA. It allows to maximize serving commands pipelining. And
>> this suggestion really works allowing to improve throughput in 50-100%!
Seems I should elaborate more on this. Case, when client is remote has a
fundamental difference from the case, when client is local, for which
Linux currently optimized. When client is local data delivered to it
from the page cache with a virtually infinite speed. But when client is
remote data delivered to it from the server's cache on a *finite* speed.
In our case this speed is about the same as speed of reading data to the
cache from the storage. It has the following consequences:
1. Data for any READ request at first transferred from the storage to
the cache, then from the cache to the client. If those transfers are
done purely sequentially without overlapping, i.e. without any
readahead, resulting throughput T can be found from equation: 1/T =
1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the
local (i.e. from the storage) and remote links. In case, when Tlocal ~=
Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)
2. If data transfers on the local and remote links aren't coordinated,
it is possible that only one link transfers data at some time. From the
(1) above you can calculate that % of this "idle" time is % of the lost
throughput. I.e. to get the maximum throughput both links should
transfer data as simultaneous as possible. For our case, when Tlocal ~=
Tremote, both links should be all the time busy. Moreover, it is
possible that the local transfer finished, but during the remote
transfer the storage media rotated too far, so for the next request it
will be needed to wait the full rotation to finish (i.e. several ms of
lost bandwidth).
Thus, to get the maximum possible throughput, we need to maximize
simultaneous load of both local and remote links. It can be done by
using well known pipelining technique. For that client should read the
same amount of data at once, but those read should be split on smaller
chunks, like 64K at time. This approach looks being against the
"conventional wisdom", saying that bigger request means bigger
throughput, but, in fact, it doesn't, because the same (big) amount of
data are read at time. Bigger count of smaller requests will make more
simultaneous load on both participating in the data transfers links. In
fact, even if client is local, in most cases there is a second data
transfer link. It's in the storage. This is especially true for RAID
controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K
and increase RA in the above link? ;)
Of course, max_sectors_kb should be decreased only for smart devices,
which allow >1 outstanding requests at time, i.e. for all modern
SCSI/SAS/SATA/iSCSI/FC/etc. drives.
There is an objection against having too many outstanding requests at
time. This is latency. But, since overall size of all requests remains
unchanged, this objection isn't relevant in this proposal. There is the
same latency-related objection against increasing RA. But many small
individual RA requests it isn't relevant as well.
We did some measurements to support the this proposal. They were done
only with deadline scheduler to make the picture clearer. They were done
with context RA. The tests were the same as before.
--- Baseline, all default:
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 51,1 MB/s
b) 51,4 MB/s
c) 51,1 MB/s
Run at the same time:
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s
--- Client - all default, on the server max_sectors_kb set to 64K:
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 100 MB/s
- 100 MB/s
- 102 MB/s
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 5,2 MB/s
- 5,3 MB/s
- 4,2 MB/s
100% and 8% improvement comparing to the baseline.
From the previous e-mail you can see that with 4096K RA
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,9 MB/s
b) 39,5 MB/s
c) 38,4 MB/s
I.e. there is 760% improvement over the baseline.
Thus, I believe, that for all devices, supporting queue depths >1,
max_sectors_kb should be set by default to 64K (or to 128K, maybe, but
not more), and default RA increased to at least 1M, better 2-4M.
> (Can I wish a CONFIG_PRINTK_TIME=y next time? :-)
Sure
Thanks,
Vlad
Download attachment "req_split.log.bz2" of type "application/x-bzip" (5683 bytes)
Powered by blists - more mailing lists