[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FA8063F.5080505@pocock.com.au>
Date: Mon, 07 May 2012 19:28:31 +0200
From: Daniel Pocock <daniel@...ock.com.au>
To: Andreas Dilger <adilger@...ger.ca>
CC: Martin Steigerwald <Martin@...htvoll.de>,
linux-ext4@...r.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache
On 07/05/12 18:54, Andreas Dilger wrote:
> On 2012-05-07, at 10:44 AM, Daniel Pocock wrote:
>
>> On 07/05/12 18:25, Martin Steigerwald wrote:
>>
>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>
>>>> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
>>>> md RAID1
>>>> LVM
>>>> ext4
>>>>
>>>> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive,
>>>> I observe write performance over NFS of 1MB/sec (unpacking a
>>>> big source tarball)
>>>>
>>>> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive,
>>>> I observe write performance over NFS of 10MB/sec
>>>>
>>>> c) If I just use the async option on NFS, I observe up to 30MB/sec
>>>>
> The only proper way to isolate the cause of performance problems is to test each layer separately.
>
> What is the performance running this workload against the same ext4
> filesystem locally (i.e. without NFS)? How big are the files? If
> you run some kind of low-level benchmark against the underlying MD
> RAID array, with synchronous IOPS of the average file size, what is
> the performance?
>
>
- the test file is 5MB compressed, over 100MB uncompressed, many C++
files of varying sizes
- testing it locally is definitely faster - but local disk writes can be
cached more aggressively than writes from an NFS client, so it is not
strictly comparable
> Do you have something like the MD RAID resync bitmaps enabled? That
> can kill performance, though it improves the rebuild time after a
> crash. Putting these bitmaps onto a small SSH, or e.g. a separate
> boot disk (if you have one) can improve performance significantly.
>
>
I've checked /proc/mdstat, it doesn't report any bitmap at all
>>> c) won´t harm local filesystem consistency, but should the nfs server break down all data that the NFS clients sent to the server for
>>> writing which is not written yet is gone.
>>>
>> Most of the access is from NFS, so (c) is not a good solution either.
>>
> Well, this behaviour is not significantly worse than applications
> writing to a local filesystem, and the node crashing and losing the
> dirty data in memory that has not been written to disk.
>
>
A lot of the documents I've seen about NFS performance suggest it is
slightly worse though, because the applications on the client have
received positive responses from fsync()
>>>> - or must I just use option (b) but make it safer with battery-backed
>>>> write cache?
>>>>
>>> If you want performance and safety that is the best option from the
>>> ones you mentioned, if the workload is really I/O bound on the local filesystem.
>>>
>>> Of course you can try the usual tricks like noatime, remove rsize and
>>> wsize options on the NFS client if they have a new enough kernel (they
>>> autotune to much higher than the often recommended 8192 or 32768 bytes,
>>> look at /proc/mounts), put ext4 journal onto an extra disk to reduce head seeks, check whether enough NFS server threads are running, try a
>>> different filesystem and so on.
>>>
>> One further discovery I made: I decided to eliminate md and LVM. I had
>> enough space to create a 256MB partition on one of the disks, and format
>> it directly with ext4
>>
>> Writing to that partition from the NFS3 client:
>> - less than 500kBytes/sec (for unpacking a tarball of source code)
>> - around 50MB/sec (dd if=/dev/zero conv=fsync bs=65536)
>>
>> and I then connected an old 5400rpm USB disk to the machine, ran the
>> same test from the NFS client:
>> - 5MBytes/sec (for unpacking a tarball of source code) - 10x faster than
>> the 72k SATA disk
>>
> Possibly the older disk is lying about doing cache flushes. The
> wonderful disk manufacturers do that with commodity drives to make
> their benchmark numbers look better. If you run some random IOPS
> test against this disk, and it has performance much over 100 IOPS
> then it is definitely not doing real cache flushes.
>
>
I would agree that is possible - I actually tried using hdparm and
sdparm to check cache status, but they don't work with the USB drive
I've tried the following directly onto the raw device:
dd if=/dev/zero of=/dev/sdc1 bs=4096 count=65536 conv=fsync
29.2MB/s
and iostat reported avg 250 write/sec, avgrq-sz = 237, wkB/s = 30MB/sec
I tried a smaller write as well (just count=1024, total 4MB of data) and
it also reported a slower speed, which suggests that it really is
writing the data out to disk and not just caching.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists