[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20081128004830.GA8874@localhost>
Date: Fri, 28 Nov 2008 08:48:30 +0800
From: Wu Fengguang <wfg@...ux.intel.com>
To: Vladislav Bolkhovitin <vst@...b.net>
Cc: Jens Axboe <jens.axboe@...cle.com>, Jeff Moyer <jmoyer@...hat.com>,
"Vitaly V. Bursov" <vitalyb@...enet.dn.ua>,
linux-kernel@...r.kernel.org, linux-nfs@...r.kernel.org
Subject: Re: Slow file transfer speeds with CFQ IO scheduler in some cases
On Thu, Nov 27, 2008 at 08:46:35PM +0300, Vladislav Bolkhovitin wrote:
>
> Wu Fengguang wrote:
>> On Tue, Nov 25, 2008 at 03:09:12PM +0300, Vladislav Bolkhovitin wrote:
>>> Vladislav Bolkhovitin wrote:
>>>> Wu Fengguang wrote:
>>>>> On Tue, Nov 25, 2008 at 02:41:47PM +0300, Vladislav Bolkhovitin wrote:
>>>>>> Wu Fengguang wrote:
>>>>>>> On Tue, Nov 25, 2008 at 01:59:53PM +0300, Vladislav Bolkhovitin wrote:
>>>>>>>> Wu Fengguang wrote:
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> //Sorry for being late.
>>>>>>>>>
>>>>>>>>> On Wed, Nov 12, 2008 at 08:02:28PM +0100, Jens Axboe wrote:
>>>>>>>>> [...]
>>>>>>>>>> I already talked about this with Jeff on irc, but I guess should post it
>>>>>>>>>> here as well.
>>>>>>>>>>
>>>>>>>>>> nfsd aside (which does seem to have some different behaviour skewing the
>>>>>>>>>> results), the original patch came about because dump(8) has a really
>>>>>>>>>> stupid design that offloads IO to a number of processes. This basically
>>>>>>>>>> makes fairly sequential IO more random with CFQ, since each process gets
>>>>>>>>>> its own io context. My feeling is that we should fix dump instead of
>>>>>>>>>> introducing a fair bit of complexity (and slowdown) in CFQ. I'm not
>>>>>>>>>> aware of any other good programs out there that would do something
>>>>>>>>>> similar, so I don't think there's a lot of merrit to spending cycles on
>>>>>>>>>> detecting cooperating processes.
>>>>>>>>>>
>>>>>>>>>> Jeff will take a look at fixing dump instead, and I may have promised
>>>>>>>>>> him that santa will bring him something nice this year if he does (since
>>>>>>>>>> I'm sure it'll be painful on the eyes).
>>>>>>>>> This could also be fixed at the VFS readahead level.
>>>>>>>>>
>>>>>>>>> In fact I've seen many kinds of interleaved accesses:
>>>>>>>>> - concurrently reading 40 files that are in fact hard links of one single file
>>>>>>>>> - a backup tool that splits a big file into 8k chunks, and serve the
>>>>>>>>> {1, 3, 5, 7, ...} chunks in one process and the {0, 2, 4, 6, ...}
>>>>>>>>> chunks in another one
>>>>>>>>> - a pool of NFSDs randomly serving some originally
>>>>>>>>> sequential read requests - now dump(8) seems to have
>>>>>>>>> some similar problem.
>>>>>>>>>
>>>>>>>>> In summary there have been all kinds of efforts on trying to
>>>>>>>>> parallelize I/O tasks, but unfortunately they can easily screw up the
>>>>>>>>> sequential pattern. It may not be easily fixable for many of them.
>>>>>>>>>
>>>>>>>>> It is however possible to detect most of these patterns at the
>>>>>>>>> readahead layer and restore sequential I/Os, before they propagate
>>>>>>>>> into the block layer and hurt performance.
>>>>>>>> I believe this would be the most effective way to go,
>>>>>>>> especially in case if data delivery path to the original
>>>>>>>> client has its own latency depended from the amount of
>>>>>>>> transferred data as it is in the case of remote NFS mount,
>>>>>>>> which does synchronous sequential reads. In this case it
>>>>>>>> is essential for performance to make both links (local to
>>>>>>>> the storage and network to the client) be always busy and
>>>>>>>> transfer data simultaneously. Since the reads are
>>>>>>>> synchronous, the only way to achieve that is perform read
>>>>>>>> ahead on the server sufficient to cover the network link
>>>>>>>> latency. Otherwise you would end up with only half of
>>>>>>>> possible throughput.
>>>>>>>>
>>>>>>>> However, from one side, server has to have a pool of
>>>>>>>> threads/processes to perform well, but, from other side,
>>>>>>>> current read ahead code doesn't detect too well that those
>>>>>>>> threads/processes are doing joint sequential read, so the
>>>>>>>> read ahead window gets smaller, hence the overall read
>>>>>>>> performance gets considerably smaller too.
>>>>>>>>
>>>>>>>>> Vitaly, if that's what you need, I can try to prepare a patch for testing out.
>>>>>>>> I can test it with SCST SCSI target sybsystem
>>>>>>>> (http://scst.sf.net). SCST needs such feature very much,
>>>>>>>> otherwise it can't get full backstorage read speed. The
>>>>>>>> maximum I can see is about ~80MB/s from ~130MB/s 15K RPM
>>>>>>>> disk over 1Gbps iSCSI link (maximum possible is ~110MB/s).
>>>>>>> Thank you very much!
>>>>>>>
>>>>>>> BTW, do you implicate that the SCSI system (or its applications) has
>>>>>>> similar behaviors that the current readahead code cannot handle well?
>>>>>> No. SCSI target subsystem is not the same as SCSI initiator
>>>>>> subsystem, which usually called simply SCSI (sub)system. SCSI
>>>>>> target is a SCSI server. It has the same amount of common with
>>>>>> SCSI initiator as there is, e.g., between Apache (HTTP server)
>>>>>> and Firefox (HTTP client).
>>>>> Got it. So the SCSI server will split&spread sequential IO of one
>>>>> single file to cooperative threads?
>>>> Yes. It has to do so, because Linux doesn't have async. cached IO
>>>> and a client can queue several tens of commands at time. Then, on
>>>> the sequential IO with 1 command at time, CPU scheduler comes to
>>>> play and spreads those commands over those threads, so read ahead
>>>> gets too small to cover the external link latency and fill both
>>>> links with data, so that uncovered latency kills throughput.
>>> Additionally, if the uncovered external link latency is too large,
>>> one more factor is getting noticeable: storage rotation latency. If
>>> the next unread sector is missed to be read at time, server has to
>>> wait a full rotation to start receiving data for the next block,
>>> which even more decreases the resulting throughput.
>
>> Thank you for the details. I've been working slowly on the idea, and
>> should be able to send you a patch in the next one or two days.
>
> Actually, there's one more thing, which should have been mentioned. It
> is possible that remote clients have several sequential read streams at
> time together with some "noise" of random requests. A good read-ahead
> subsystem should handle such case by keeping big read-ahead windows for
> the sequential streams and don't do any read ahead for the random
> requests. And all at the same time.
>
> Currently on such workloads read ahead will be completely disabled for
> all the requests. Hence, there is a possibility here to improve
> performance in 3-5 times or even more by making the workload more linear.
Are you sure? I'd expect such mixed-sequential-random pattern to be
handled by the current readahead code pretty well: sequential ones
will get large readahead and random ones won't get readahead at all.
Attached is the context readahead patch plus a kernel module for
readahead tracing and accounting, which will hopefully help clarify the
read patterns and readahead behaviors on production workloads. It is
based on 2.6.27 for your convenience, but also applies to 2.6.28.
The patch is not targeted for code review, but if anyone are interested,
you can take a look at try_context_readahead(). This is the only newly
introduced readahead policy, the other majorities are code refactor
and tracing facilities.
The newly introduced context readahead policy is disabled by default.
To enable it:
echo 1 > /sys/block/sda/queue/context_readahead
I'm not sure for now whether this parameter will be a long term one, or
whether the context readahead policy should be enabled unconditionally.
The readahead accounting stats can be viewed by
mount -t debugfs none /sys/kernel/debug
cat /sys/kernel/debug/readahead/stats
The numbers can be reset by
echo > /sys/kernel/debug/readahead/stats
Here is a sample output from my desktop:
% cat /sys/kernel/debug/readahead/stats
pattern count sync_count eof_count size async_size actual
none 0 0 0 0 0 0
initial0 3009 3009 2033 5 4 2
initial 35 35 0 5 4 3
subsequent 1294 240 827 52 49 26
marker 220 0 109 54 53 29
trail 0 0 0 0 0 0
oversize 0 0 0 0 0 0
reverse 0 0 0 0 0 0
stride 0 0 0 0 0 0
thrash 0 0 0 0 0 0
mmap 2833 2833 1379 142 0 47
fadvise 7 7 7 0 0 40
random 7621 7621 69 1 0 1
all 15019 13745 4424 33 5 12
The readahead/read tracing messages are disabled by default.
To enable them:
echo 1 > /sys/kernel/debug/readahead/trace_enable
echo 1 > /sys/kernel/debug/readahead/read_jprobes
They(especially the latter one) will generate a lot of printk messages like:
[ 828.151013] readahead-initial0(pid=4644(zsh), dev=00:10(0:10), ino=351452(whoami), req=0+1, ra=0+4-3, async=0) = 4
[ 828.167853] readahead-mmap(pid=4644(whoami), dev=00:10(0:10), ino=351452(whoami), req=0+0, ra=0+60-0, async=0) = 3
[ 828.195652] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=115569(zsh_prompt), req=0+128, ra=0+120-60, async=0) = 3
[ 828.225081] readahead-initial0(pid=4629(zsh), dev=00:10(0:10), ino=342086(.zsh_history), req=0+128, ra=0+120-60, async=0) = 4
[ 964.471450] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=0, count=128)
[ 964.471544] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=64, count=448)
[ 964.471575] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=351445(wc), pos=512, count=28)
[ 964.472659] do_generic_file_read(pid=4685(zsh), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=0, count=128)
[ 964.473431] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383002(ld-2.7.so), pos=64, count=336)
[ 964.475639] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=383010(libc-2.7.so), pos=0, count=832)
[ 964.479037] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=0, count=524288)
[ 964.479166] do_generic_file_read(pid=4685(wc), dev=00:10(0:10), ino=196085(locale.alias), pos=2586, count=524288)
So please enable them only when necessary.
My recommendation for the double readahead in NFS client and NFS servers,
is to keep client side readahead size small and the server side one large.
For example, 128K-512K/1M-2M(more for RAID). The NFS client side readahead size
is not directly tunable, but setting rsize to a small value does the trick.
Currently the NFS magic is readahead_size=N*rsize. The default numbers in my
2.6.28 kernel are rsize=512k, N=15, readahead_size=7680k. The latter is
obviously way too large.
Thanks,
Fengguang
> I guess, such functionality can be done by keeping several read-ahead
> contexts not in file handle as now, but in inode, or ever in task_struct
> similarly to io_context. Or even as a part of struct io_context. Then
> those contexts would be rotated in, e.g., round robin manner. I have
> some basic thoughts in this area and would be glad to share them, if you
> are interested.
>
> Going further, ultimately, it would be great to provide somehow a
> capability to allow to assign for each read-ahead stream own IO context,
> because on the remote side in most cases such streams would correspond
> to different programs reading data in parallel. This should allow to
> increase performance even more.
>
>> Thanks,
>> Fengguang
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@...r.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>>
View attachment "readahead-context-plus-tracing-module-2.6.27.patch" of type "text/x-diff" (25610 bytes)
Powered by blists - more mailing lists