[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <49946BE6.1040005@vlnb.net>
Date: Thu, 12 Feb 2009 21:35:18 +0300
From: Vladislav Bolkhovitin <vst@...b.net>
To: Wu Fengguang <wfg@...ux.intel.com>,
Jens Axboe <jens.axboe@...cle.com>
CC: Jeff Moyer <jmoyer@...hat.com>,
"Vitaly V. Bursov" <vitalyb@...enet.dn.ua>,
linux-kernel@...r.kernel.org, linux-nfs@...r.kernel.org
Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases
Wu Fengguang, on 11/28/2008 03:48 AM wrote:
>> Actually, there's one more thing, which should have been mentioned. It
>> is possible that remote clients have several sequential read streams at
>> time together with some "noise" of random requests. A good read-ahead
>> subsystem should handle such case by keeping big read-ahead windows for
>> the sequential streams and don't do any read ahead for the random
>> requests. And all at the same time.
>>
>> Currently on such workloads read ahead will be completely disabled for
>> all the requests. Hence, there is a possibility here to improve
>> performance in 3-5 times or even more by making the workload more linear.
>
> Are you sure? I'd expect such mixed-sequential-random pattern to be
> handled by the current readahead code pretty well: sequential ones
> will get large readahead and random ones won't get readahead at all.
No, sorry, my data was outdated. I rechecked and it works quite well now.
> Attached is the context readahead patch plus a kernel module for
> readahead tracing and accounting, which will hopefully help clarify the
> read patterns and readahead behaviors on production workloads. It is
> based on 2.6.27 for your convenience, but also applies to 2.6.28.
>
> The patch is not targeted for code review, but if anyone are interested,
> you can take a look at try_context_readahead(). This is the only newly
> introduced readahead policy, the other majorities are code refactor
> and tracing facilities.
>
> The newly introduced context readahead policy is disabled by default.
> To enable it:
> echo 1 > /sys/block/sda/queue/context_readahead
> I'm not sure for now whether this parameter will be a long term one, or
> whether the context readahead policy should be enabled unconditionally.
>
> The readahead accounting stats can be viewed by
> mount -t debugfs none /sys/kernel/debug
> cat /sys/kernel/debug/readahead/stats
> The numbers can be reset by
> echo > /sys/kernel/debug/readahead/stats
>
> Here is a sample output from my desktop:
>
> % cat /sys/kernel/debug/readahead/stats
> pattern count sync_count eof_count size async_size actual
> none 0 0 0 0 0 0
> initial0 3009 3009 2033 5 4 2
> initial 35 35 0 5 4 3
> subsequent 1294 240 827 52 49 26
> marker 220 0 109 54 53 29
> trail 0 0 0 0 0 0
> oversize 0 0 0 0 0 0
> reverse 0 0 0 0 0 0
> stride 0 0 0 0 0 0
> thrash 0 0 0 0 0 0
> mmap 2833 2833 1379 142 0 47
> fadvise 7 7 7 0 0 40
> random 7621 7621 69 1 0 1
> all 15019 13745 4424 33 5 12
>
> The readahead/read tracing messages are disabled by default.
> To enable them:
> echo 1 > /sys/kernel/debug/readahead/trace_enable
> echo 1 > /sys/kernel/debug/readahead/read_jprobes
> They(especially the latter one) will generate a lot of printk messages like:
>
> [ 828.151013] readahead-initial0(pid=4644(zsh), dev=00:10(0:10), ino=351452(whoami), req=0+1, ra=0+4-3, async=0) = 4
> [ 828.167853] readahead-mmap(pid=4644(whoami), dev=00:10(0:10), ino=351452(whoami), req=0+0, ra=0+60-0, async=0) = 3
> [ 828.195652] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=115569(zsh_prompt), req=0+128, ra=0+120-60, async=0) = 3
> [ 828.225081] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=342086(.zsh_history), req=0+128, ra=0+120-60, async=0) = 4
>
> [ 964.471450] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=0, count=128)
> [ 964.471544] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=64, count=448)
> [ 964.471575] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=512, count=28)
> [ 964.472659] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=0, count=128)
> [ 964.473431] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=64, count=336)
> [ 964.475639] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383010(libc-2.7.so), pos=0, count=832)
> [ 964.479037] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=0, count=524288)
> [ 964.479166] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=2586, count=524288)
>
> So please enable them only when necessary.
>
> My recommendation for the double readahead in NFS client and NFS servers,
> is to keep client side readahead size small and the server side one large.
> For example, 128K-512K/1M-2M(more for RAID). The NFS client side readahead size
> is not directly tunable, but setting rsize to a small value does the trick.
> Currently the NFS magic is readahead_size=N*rsize. The default numbers in my
> 2.6.28 kernel are rsize=512k, N=15, readahead_size=7680k. The latter is
> obviously way too large.
Sorry for such a huge delay. There were many other activities I had to
do before + I had to be sure I didn't miss anything.
We didn't use NFS, we used SCST (http://scst.sourceforge.net) with
iSCSI-SCST target driver. It has similar to NFS architecture, where N
threads (N=5 in this case) handle IO from remote initiators (clients)
coming from wire using iSCSI protocol. In addition, SCST has patch
called export_alloc_io_context (see
http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads
queue IO using single IO context, so we can see if context RA can
replace grouping IO threads in single IO context.
Unfortunately, the results are negative. We find neither any advantages
of context RA over current RA implementation, nor possibility for
context RA to replace grouping IO threads in single IO context.
Setup on the target (server) was the following. 2 SATA drives grouped in
md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero
of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB)
copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3
partitions. The first partition was 10% of space in the beginning of the
device, the last partition was 10% of space in the end of the device,
the middle one was the rest in the middle of the space them. Then the
first and the last partitions were exported to the initiator (client).
They were /dev/sdb and /dev/sdc on it correspondingly.
Then 4 2.6.27.12 kernels were build:
1. With all SCST patches
2. With all SCST patches, except export_alloc_io_context
3. With all SCST patches + context RA patch
4. With all SCST patches, except export_alloc_io_context, but with
context RA patch.
Memory on both initiator and target was limited to 512MB. Link was 1GbE.
Then for those kernels the following tests were ran:
1. dd if=/dev/sdb of=/dev/null bs=64K count=80000
2. dd if=/dev/sdc of=/dev/null bs=64K count=80000
3. while true; do dd if=/dev/sdc of=/dev/null bs=64K; done and
simultaneously dd if=/dev/sdb of=/dev/null bs=64K count=80000. Results
from the latter dd was written. This test allowed to see how well
simultaneous reads are handled.
You can find the results in the attachement.
You can see that context RA doesn't improve anything, while grouping IO
in single IO context provides almost 100% improvement in throughput.
Additional interesting observation is how badly simultaneous read IO
streams are handled, if they aren't grouped in the corresponding IO
contexts. In test 3 the result was as low as 4(!)MB/s. Wu, Jens, do you
have any explanation on this? Why the inner tracks have so big preference?
Another thing looks suspicious for me. If simultaneous read IO streams
are sent and they are grouped in the corresponding IO contexts, dd from
sdb has only 20MB/s throughput. Considering that a single stream from it
has about 100MB/s, shouldn't that value be at least 30-35MB/s? Is it the
same issue as above, but with smaller implication?
Thanks,
Vlad
View attachment "cfq-scheduler.txt" of type "text/plain" (14160 bytes)
Powered by blists - more mailing lists