linux-kernel - Re: Slow file transfer speeds with CFQ IO scheduler in some cases

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49EE0DF1.6000502@vlnb.net>
Date:	Tue, 21 Apr 2009 22:18:25 +0400
From:	Vladislav Bolkhovitin <vst@...b.net>
To:	Wu Fengguang <wfg@...ux.intel.com>
CC:	Jens Axboe <jens.axboe@...cle.com>, Jeff Moyer <jmoyer@...hat.com>,
	"Vitaly V. Bursov" <vitalyb@...enet.dn.ua>,
	linux-kernel@...r.kernel.org, linux-nfs@...r.kernel.org,
	lukasz.jurewicz@...il.com
Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases

Wu Fengguang, on 03/23/2009 04:42 AM wrote:
>> Here are the conclusions from tests:
>>
>>  1. Making all IO threads work in the same IO context with CFQ (vanilla  
>> RA and default RA size) brings near 100% link utilization on single  
>> stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there  
>> is 100% improvement of CFQ over deadline. With 2 read streams CFQ has  
>> ever more advantage: >400% (23MB/s vs 5MB/s).
> 
> The ideal 2-stream throughput should be >60MB/s, so I guess there are
> still room of improvements for the CFQ's 23MB/s?

Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K 
we were able to get ~40MB/s, see the previous e-mail and below.

> The one fact I cannot understand is that SCST seems to breaking up the
> client side 64K reads into server side 4K reads(above readahead layer).
> But I remember you told me that SCST don't do NFS rsize style split-ups.
> Is this a bug? The 4K read size is too small to be CPU/network friendly...
> Where are the split-up and re-assemble done? On the client side or
> internal to the server?

This is on the client's side. See the target's log in the attachment. 
Here is the summary of commands data sizes, came to the server, for "dd 
if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:

4K                       11
8K                       0
16K                      0
32K                      0
64K                      0
128K                     81
256K                     8
512K                     0
1024K                    0
2048K                    0
4096K                    0

There's a way too many 4K requests. Apparently, the requests submission 
path isn't optimal.

Actually, this is another question I wanted to rise from the very 
beginning.

>>  6. Unexpected result. In case, when ll IO threads work in the same IO  
>> context with CFQ increasing RA size *decreases* throughput. I think this  
>> is, because RA requests performed as single big READ requests, while  
>> requests coming from remote clients are much smaller in size (up to  
>> 256K), so, when the read by RA data transferred to the remote client on  
>> 100MB/s speed, the backstorage media gets rotated a bit, so the next  
>> read request must wait the rotation latency (~0.1ms on 7200RPM). This is  
>> well conforms with (3) above, when context RA has 40% advantage over  
>> vanilla RA with default RA, but much smaller with higher RA.
> 
> Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...

That doesn't matter, because new request from the client won't come 
until all data for the previous one transferred to it. And that transfer 
is done on a very *finite* speed.

>> Bottom line IMHO conclusions:
>>
>> 1. Context RA should be considered after additional examination to  
>> replace current RA algorithm in the kernel
> 
> That's my plan to push context RA to mainline. And thank you very much
> for providing and testing out a real world application for it!

You're welcome!

>> 2. It would be better to increase default RA size to 1024K
> 
> That's a long wish to increase the default RA size. However I have a
> vague feeling that it would be better to first make the lower layers
> more smart on max_sectors_kb granularity request splitting and batching.

Can you elaborate more on that, please?

>> *AND* one of the following:
>>
>> 3.1. All RA requests should be split in smaller requests with size up to  
>> 256K, which should not be merged with any other request
> 
> Are you referring to max_sectors_kb?

Yes

> What's your max_sectors_kb and nr_requests? Something like
> 
>         grep -r . /sys/block/sda/queue/

Default: 512 and 128 correspondingly.

>> OR
>>
>> 3.2. New RA requests should be sent before the previous one completed to  
>> don't let the storage device rotate too far to need full rotation to  
>> serve the next request.
> 
> Linus has a mmap readahead cleanup patch that can do this. It
> basically replaces a {find_lock_page(); readahead();} sequence into
> {find_get_page(); readahead(); lock_page();}.
> 
> I'll try to push that patch into mainline.

Good!

>> I like suggestion 3.1 a lot more, since it should be simple to implement  
>> and has the following 2 positive side effects:
>>
>> 1. It would allow to minimize negative effect of higher RA size on the  
>> I/O delay latency by allowing CFQ to switch to too long waiting  
>> requests, when necessary.
>>
>> 2. It would allow better requests pipelining, which is very important to  
>> minimize uplink latency for synchronous requests (i.e. with only one IO  
>> request at time, next request issued, when the previous one completed).  
>> You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware  
>> recommends for maximum performance set max_sectors_kb as low as *64K*  
>> with 16MB RA. It allows to maximize serving commands pipelining. And  
>> this suggestion really works allowing to improve throughput in 50-100%!

Seems I should elaborate more on this. Case, when client is remote has a 
fundamental difference from the case, when client is local, for which 
Linux currently optimized. When client is local data delivered to it 
from the page cache with a virtually infinite speed. But when client is 
remote data delivered to it from the server's cache on a *finite* speed. 
In our case this speed is about the same as speed of reading data to the 
cache from the storage. It has the following consequences:

1. Data for any READ request at first transferred from the storage to 
the cache, then from the cache to the client. If those transfers are 
done purely sequentially without overlapping, i.e. without any 
readahead, resulting throughput T can be found from equation: 1/T = 
1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the 
local (i.e. from the storage) and remote links. In case, when Tlocal ~= 
Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)

2. If data transfers on the local and remote links aren't coordinated, 
it is possible that only one link transfers data at some time. From the 
(1) above you can calculate that % of this "idle" time is % of the lost 
throughput. I.e. to get the maximum throughput both links should 
transfer data as simultaneous as possible. For our case, when Tlocal ~= 
Tremote, both links should be all the time busy. Moreover, it is 
possible that the local transfer finished, but during the remote 
transfer the storage media rotated too far, so for the next request it 
will be needed to wait the full rotation to finish (i.e. several ms of 
lost bandwidth).

Thus, to get the maximum possible throughput, we need to maximize 
simultaneous load of both local and remote links. It can be done by 
using well known pipelining technique. For that client should read the 
same amount of data at once, but those read should be split on smaller 
chunks, like 64K at time. This approach looks being against the 
"conventional wisdom", saying that bigger request means bigger 
throughput, but, in fact, it doesn't, because the same (big) amount of 
data are read at time. Bigger count of smaller requests will make more 
simultaneous load on both participating in the data transfers links. In 
fact, even if client is local, in most cases there is a second data 
transfer link. It's in the storage. This is especially true for RAID 
controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K 
and increase RA in the above link? ;)

Of course, max_sectors_kb should be decreased only for smart devices, 
which allow >1 outstanding requests at time, i.e. for all modern 
SCSI/SAS/SATA/iSCSI/FC/etc. drives.

There is an objection against having too many outstanding requests at 
time. This is latency. But, since overall size of all requests remains 
unchanged, this objection isn't relevant in this proposal. There is the 
same latency-related objection against increasing RA. But many small 
individual RA requests it isn't relevant as well.

We did some measurements to support the this proposal. They were done 
only with deadline scheduler to make the picture clearer. They were done 
with context RA. The tests were the same as before.

--- Baseline, all default:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
          a) 51,1 MB/s
          b) 51,4 MB/s
          c) 51,1 MB/s

Run at the same time:
# while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
          a) 4,7 MB/s
          b) 4,6 MB/s
          c) 4,8 MB/s

--- Client - all default, on the server max_sectors_kb set to 64K:

# dd if=/dev/sdb of=/dev/null bs=64K count=80000
     - 100 MB/s
     - 100 MB/s
     - 102 MB/s

# while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
     - 5,2 MB/s
     - 5,3 MB/s
     - 4,2 MB/s

100% and 8% improvement comparing to the baseline.

 From the previous e-mail you can see that with 4096K RA

# while true; do dd if=/dev/sdc  of=/dev/null bs=64K; done	
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
	a) 39,9 MB/s
	b) 39,5 MB/s
	c) 38,4 MB/s

I.e. there is 760% improvement over the baseline.

Thus, I believe, that for all devices, supporting queue depths >1, 
max_sectors_kb should be set by default to 64K (or to 128K, maybe, but 
not more), and default RA increased to at least 1M, better 2-4M.

> (Can I wish a CONFIG_PRINTK_TIME=y next time? :-)

Sure

Thanks,
Vlad


Download attachment "req_split.log.bz2" of type "application/x-bzip" (5683 bytes)