[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4875C45C.2010901@fastmq.com>
Date: Thu, 10 Jul 2008 10:12:12 +0200
From: Martin Sustrik <sustrik@...tmq.com>
To: Andrew Morton <akpm@...ux-foundation.org>
CC: Martin Lucina <mato@...elna.sk>, linux-kernel@...r.kernel.org
Subject: Re: Higher than expected disk write(2) latency
Hi Andrew,
>> we're getting some rather high figures for write(2) latency when testing
>> synchronous writing to disk. The test I'm running writes 2000 blocks of
>> contiguous data to a raw device, using O_DIRECT and various block sizes
>> down to a minimum of 512 bytes.
>>
>> The disk is a Seagate ST380817AS SATA connected to an Intel ICH7
>> using ata_piix. Write caching has been explicitly disabled on the
>> drive, and there is no other activity that should affect the test
>> results (all system filesystems are on a separate drive). The system is
>> running Debian etch, with a 2.6.24 kernel.
>>
>> Observed results:
>>
>> size=1024, N=2000, took=4.450788 s, thput=3 mb/s seekc=1
>> write: avg=8.388851 max=24.998846 min=8.335624 ms
>> 8 ms: 1992 cases
>> 9 ms: 2 cases
>> 10 ms: 1 cases
>> 14 ms: 1 cases
>> 16 ms: 3 cases
>> 24 ms: 1 cases
>
> stoopid question 1: are you writing to a regular file, or to /dev/sda? If
> the former then metadata fetches will introduce glitches.
Not a file, just a raw device.
> stoopid question 2: does the same effect happen with reads?
Dunno. The read is not critical for us. However, I would expect the same
behaviour (see below).
We've got a satisfying explansation of the behaviour from Roger Heflin:
"You write sector n and n+1, it takes some amount of time for that first
set of sectors to come under the head, when it does you write it and
immediately return. Immediately after that you attempt write sector
n+2 and n+3 which just a bit ago passed under the head, so you have to
wait an *ENTIRE* revolution for those sectors to again come under the
head to be written, another ~8.3ms, and you continue to repeat this with
each block being written. If the sector was randomly placed in the
rotation (ie 50% chance of the disk being off by 1/2 a rotation or
less-you would have a 4.15 ms average seek time for your test)-but the
case of sequential sync writes this leaves the sector about as far as
possible from the head (it just passed under the head)."
Now, the obvious solution was to use AIO to be able to enqueue write
requests even before the head reaches the end of the sector - thus there
would be no need for superfluous disk revolvings.
We've actually measured this scenario with kernel AIO (libaio1) and this
is what we'vew got (see attached graph).
The x axis represents individual write operations, y axis represents
time. Crosses are operations enqueue times (when write requests were
issues), circles are times of notifications (when the app was notified
that the write request was processed).
What we see is that AIO performs rather bad while we are still
enqueueing more writes (it misses right position on the disk and has to
do superfluous disk revolvings), however, once we stop enqueueing new
write request, those already in the queue are processed swiftly.
My guess (I am not a kernel hacker) would be that sync operations on the
AIO queue are slowing down the retrieval from the queue and thus we miss
the right place on the disk almost all the time. Once app stops
enqueueing new write requests there's no contention on the queue and we
are able to catch up with the speed of disk rotation.
If this is the case, the solution would be straightforward: When
dequeueing from AIO queue, dequeue *all* the requests in the queue and
place them into another non-synchronised queue. Getting an element from
a non-sync queue is matter of few nanoseconds, thus we should be able to
process it before head missis the right point on the disk. Once the
non-sync queue is empty, we get *all* the requests from the AIO queue
again. Etc.
Anyone any opinion on this matter?
Thanks.
Martin
Download attachment "aio.png" of type "image/png" (3628 bytes)
Powered by blists - more mailing lists