linux-ext4 - Re: ext4, barrier, md/RAID1 and write cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <E19B872E-1392-4B1E-8093-E1A666ECEA36@dilger.ca>
Date:	Mon, 7 May 2012 10:54:45 -0600
From:	Andreas Dilger <adilger@...ger.ca>
To:	Daniel Pocock <daniel@...ock.com.au>
Cc:	Martin Steigerwald <Martin@...htvoll.de>,
	linux-ext4@...r.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache

On 2012-05-07, at 10:44 AM, Daniel Pocock wrote:
> On 07/05/12 18:25, Martin Steigerwald wrote:
>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>> 2x SATA drive (NCQ, 32MB cache, no hardware RAID)
>>> md RAID1
>>> LVM
>>> ext4
>>> 
>>> a) If I use data=ordered,barrier=1 and `hdparm -W 1' on the drive,
>>>    I observe write performance over NFS of 1MB/sec (unpacking a
>>>    big source tarball)
>>> 
>>> b) If I use data=writeback,barrier=0 and `hdparm -W 1' on the drive,
>>>    I observe write performance over NFS of 10MB/sec
>>> 
>>> c) If I just use the async option on NFS, I observe up to 30MB/sec

The only proper way to isolate the cause of performance problems is to test each layer separately.

What is the performance running this workload against the same ext4
filesystem locally (i.e. without NFS)?  How big are the files?  If
you run some kind of low-level benchmark against the underlying MD
RAID array, with synchronous IOPS of the average file size, what is
the performance?

Do you have something like the MD RAID resync bitmaps enabled?  That
can kill performance, though it improves the rebuild time after a
crash.  Putting these bitmaps onto a small SSH, or e.g. a separate
boot disk (if you have one) can improve performance significantly.

>> c) won´t harm local filesystem consistency, but should the nfs server break down all data that the NFS clients sent to the server for
>> writing which is not written yet is gone.
> 
> Most of the access is from NFS, so (c) is not a good solution either.

Well, this behaviour is not significantly worse than applications
writing to a local filesystem, and the node crashing and losing the
dirty data in memory that has not been written to disk.

>>> - or must I just use option (b) but make it safer with battery-backed
>>> write cache?
>> 
>> If you want performance and safety that is the best option from the
>> ones you mentioned, if the workload is really I/O bound on the local filesystem. 
>> 
>> Of course you can try the usual tricks like noatime, remove rsize and 
>> wsize options on the NFS client if they have a new enough kernel (they 
>> autotune to much higher than the often recommended 8192 or 32768 bytes, 
>> look at /proc/mounts), put ext4 journal onto an extra disk to reduce head seeks, check whether enough NFS server threads are running, try a
>> different filesystem and so on.
> 
> One further discovery I made: I decided to eliminate md and LVM.  I had
> enough space to create a 256MB partition on one of the disks, and format
> it directly with ext4
> 
> Writing to that partition from the NFS3 client:
> - less than 500kBytes/sec (for unpacking a tarball of source code)
> - around 50MB/sec (dd if=/dev/zero conv=fsync bs=65536)
> 
> and I then connected an old 5400rpm USB disk to the machine, ran the
> same test from the NFS client:
> - 5MBytes/sec (for unpacking a tarball of source code) - 10x faster than
> the 72k SATA disk

Possibly the older disk is lying about doing cache flushes.  The
wonderful disk manufacturers do that with commodity drives to make
their benchmark numbers look better.  If you run some random IOPS
test against this disk, and it has performance much over 100 IOPS
then it is definitely not doing real cache flushes.

> This last test (comparing my AHCI SATA disk to the USB disk, with no md
> or LVM) makes me think it is not an NFS problem, I feel it is some issue
> with the barriers when used with this AHCI or SATA disk.


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html