linux-ext4 - Re: ext4, barrier, md/RAID1 and write cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FA8063F.5080505@pocock.com.au>
Date:	Mon, 07 May 2012 19:28:31 +0200
From:	Daniel Pocock <daniel@...ock.com.au>
To:	Andreas Dilger <adilger@...ger.ca>
CC:	Martin Steigerwald <Martin@...htvoll.de>,
	linux-ext4@...r.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache

On 07/05/12 18:54, Andreas Dilger wrote:
> On 2012-05-07, at 10:44 AM, Daniel Pocock wrote:
>   
>> On 07/05/12 18:25, Martin Steigerwald wrote:
>>     
>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>       
>>>> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
>>>> md RAID1
>>>> LVM
>>>> ext4
>>>>
>>>> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive,
>>>>    I observe write performance over NFS of 1MB/sec (unpacking a
>>>>    big source tarball)
>>>>
>>>> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive,
>>>>    I observe write performance over NFS of 10MB/sec
>>>>
>>>> c) If I just use the async option on NFS, I observe up to 30MB/sec
>>>>         
> The only proper way to isolate the cause of performance problems is to test each layer separately.
>
> What is the performance running this workload against the same ext4
> filesystem locally (i.e. without NFS)?  How big are the files?  If
> you run some kind of low-level benchmark against the underlying MD
> RAID array, with synchronous IOPS of the average file size, what is
> the performance?
>
>   
- the test file is 5MB compressed, over 100MB uncompressed, many C++
files of varying sizes

- testing it locally is definitely faster - but local disk writes can be
cached more aggressively than writes from an NFS client, so it is not
strictly comparable

> Do you have something like the MD RAID resync bitmaps enabled?  That
> can kill performance, though it improves the rebuild time after a
> crash.  Putting these bitmaps onto a small SSH, or e.g. a separate
> boot disk (if you have one) can improve performance significantly.
>
>   
I've checked /proc/mdstat, it doesn't report any bitmap at all


>>> c) won´t harm local filesystem consistency, but should the nfs server break down all data that the NFS clients sent to the server for
>>> writing which is not written yet is gone.
>>>       
>> Most of the access is from NFS, so (c) is not a good solution either.
>>     
> Well, this behaviour is not significantly worse than applications
> writing to a local filesystem, and the node crashing and losing the
> dirty data in memory that has not been written to disk.
>
>   
A lot of the documents I've seen about NFS performance suggest it is
slightly worse though, because the applications on the client have
received positive responses from fsync()

>>>> - or must I just use option (b) but make it safer with battery-backed
>>>> write cache?
>>>>         
>>> If you want performance and safety that is the best option from the
>>> ones you mentioned, if the workload is really I/O bound on the local filesystem. 
>>>
>>> Of course you can try the usual tricks like noatime, remove rsize and 
>>> wsize options on the NFS client if they have a new enough kernel (they 
>>> autotune to much higher than the often recommended 8192 or 32768 bytes, 
>>> look at /proc/mounts), put ext4 journal onto an extra disk to reduce head seeks, check whether enough NFS server threads are running, try a
>>> different filesystem and so on.
>>>       
>> One further discovery I made: I decided to eliminate md and LVM.  I had
>> enough space to create a 256MB partition on one of the disks, and format
>> it directly with ext4
>>
>> Writing to that partition from the NFS3 client:
>> - less than 500kBytes/sec (for unpacking a tarball of source code)
>> - around 50MB/sec (dd if=/dev/zero conv=fsync bs=65536)
>>
>> and I then connected an old 5400rpm USB disk to the machine, ran the
>> same test from the NFS client:
>> - 5MBytes/sec (for unpacking a tarball of source code) - 10x faster than
>> the 72k SATA disk
>>     
> Possibly the older disk is lying about doing cache flushes.  The
> wonderful disk manufacturers do that with commodity drives to make
> their benchmark numbers look better.  If you run some random IOPS
> test against this disk, and it has performance much over 100 IOPS
> then it is definitely not doing real cache flushes.
>
>   

I would agree that is possible - I actually tried using hdparm and
sdparm to check cache status, but they don't work with the USB drive

I've tried the following directly onto the raw device:

dd if=/dev/zero of=/dev/sdc1 bs=4096 count=65536 conv=fsync
29.2MB/s

and iostat reported avg 250 write/sec, avgrq-sz = 237, wkB/s = 30MB/sec

I tried a smaller write as well (just count=1024, total 4MB of data) and
it also reported a slower speed, which suggests that it really is
writing the data out to disk and not just caching.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html