linux-ext4 - Re: ext4, barrier, md/RAID1 and write cache

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201205090930.02731.ms@teamix.de>
Date:	Wed, 9 May 2012 09:30:02 +0200
From:	Martin Steigerwald <ms@...mix.de>
To:	Daniel Pocock <daniel@...ock.com.au>
Cc:	Martin Steigerwald <Martin@...htvoll.de>,
	Andreas Dilger <adilger@...ger.ca>, linux-ext4@...r.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache

Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> On 08/05/12 14:55, Martin Steigerwald wrote:
> > Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> >> On 08/05/12 00:24, Martin Steigerwald wrote:
> >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>> On 07/05/12 20:59, Martin Steigerwald wrote:
> >>>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>>>>> Possibly the older disk is lying about doing cache flushes.  The
> >>>>>>> wonderful disk manufacturers do that with commodity drives to make
> >>>>>>> their benchmark numbers look better.  If you run some random IOPS
> >>>>>>> test against this disk, and it has performance much over 100 IOPS
> >>>>>>> then it is definitely not doing real cache flushes.
> >>>>> 
> >>>>> […]
> >>>>> 
> >>>>> I think an IOPS benchmark would be better. I.e. something like:
> >>>>> 
> >>>>> /usr/share/doc/fio/examples/ssd-test
> >>>>> 
> >>>>> (from flexible I/O tester debian package, also included in upstream
> >>>>> tarball of course)
> >>>>> 
> >>>>> adapted to your needs.
> >>>>> 
> >>>>> Maybe with different iodepth or numjobs (to simulate several threads
> >>>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> >>>>> Hitachi 5400 rpm harddisk connected via eSATA.
> >>>>> 
> >>>>> Important is direct=1 to bypass the pagecache.
> >>>> 
> >>>> Thanks for suggesting this tool, I've run it against the USB disk and
> >>>> an LV on my AHCI/SATA/md array
> >>>> 
> >>>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> >>>> to CC49) and one of the disks went offline shortly after I brought the
> >>>> system back up.  To avoid the risk that a bad drive might interfere
> >>>> with the SATA performance, I completely removed it before running any
> >>>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> >>>> thinking about Seagate Constellation SATA or even SAS.
> >>>> 
> >>>> Anyway, onto the test results:
> >>>> 
> >>>> USB disk (Seagate  9SD2A3-500 320GB):
> >>>> 
> >>>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
> >>>> 
> >>>>   write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
[…]
> >>> Please repeat the test with iodepth=1.
> >> 
> >> For the USB device:
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
> >> 
> >>   write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
[…]
> >> and for the SATA disk:
> >> 
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
> >> 
> >>   write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
[…]
> > […]
> > 
> >>      issued r/w: total=0/7005, short=0/0
> >>      
> >>      lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
> >>      lat (msec): 250=0.09%
> >>> 
> >>> 194 IOPS appears to be highly unrealistic unless NCQ or something like
> >>> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
> >>> check vendor information).
> >> 
> >> The SATA disk does have NCQ
> >> 
> >> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
> >> 
> >> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
> >> 
> >> Does this suggest that the USB disk is caching data but telling Linux
> >> the data is on disk?
> > 
> > Looks like it.
> > 
> > Some older values for a 1.5 TB WD Green Disk:
> > 
> > mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100
> > -iodepth 1 -filename /dev/sda -ioengine  libaio -direct=1
> > [...] iops: (groupid=0, jobs=1): err= 0: pid=9939
> > 
> >   read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>
> > 
> > mango:~# fio -readonly -name iops -rw=randread -bs=512  -runtime=100
> > -iodepth 32 -filename /dev/sda -ioengine  libaio -direct=1
> > iops: (groupid=0, jobs=1): err= 0: pid=10304
> > 
> >   read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec
> > 
> > mango:~# hdparm -I /dev/sda | grep -i queue
> > 
> >         Queue depth: 32
> >         
> >            *    Native Command Queueing (NCQ)
> > 
> > - 1,5 TB Western Digital, WDC WD15EADS-00P8B0
> > - Pentium 4 mit 2,80 GHz
> > - 4 GB RAM, 32-Bit Linux
> > - Linux Kernel 2.6.36
> > - fio 1.38-1
[…]
> >> It is a gigabit network and I think that the performance of the dd
> >> command proves it is not something silly like a cable fault (I have come
> >> across such faults elsewhere though)
> > 
> > What is the latency?
> 
> $ ping -s 1000 192.168.1.2
> PING 192.168.1.2 (192.168.1.2) 1000(1028) bytes of data.
> 1008 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.307 ms
> 1008 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.341 ms
> 1008 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.336 ms

Seems to be fine.

> >>> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM
> >>> SATA drives, but SATA drives are cheaper and thus you could -
> >>> depending on RAID level - increase IOPS by just using more drives.
> >> 
> >> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
> >> in the Seagate `Constellation' enterprise drive range.  I need more
> >> space anyway, and I need to replace the drive that failed, so I have to
> >> spend some money anyway - I just want to throw it in the right direction
> >> (e.g. buying a drive, or if the cheap on-board SATA controller is a
> >> bottleneck or just extremely unsophisticated, I don't mind getting a
> >> dedicated controller)
> >> 
> >> For example, if I knew that the controller is simply not suitable with
> >> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
> >> will guarantee better performance with my current kernel, I would buy
> >> that.  (However, I do want to use md RAID rather than a proprietary
> >> format, so any RAID card would be in JBOD mode)
> > 
> > They point is: How much of the performance will arrive at NFS? I can't
> > say yet.
> 
> My impression is that the faster performance of the USB disk was a red
> herring, and the problem really is just the nature of the NFS protocol
> and the way it is stricter about server-side caching (when sync is
> enabled) and consequently it needs more iops.

Yes, that seems to be the case here. It seems to be a small blocksize random 
I/O workload with heavy fsync() usage.

You could adapt to /usr/share/doc/fio/examples/iometer-file-access-server to 
benchmark such a scenario. Also fsmark simulates such a heavy fsync() based 
quite well. I have packaged it for Debian, but its still in NEW queue. You can 
grab it from

http://people.teamix.net/~ms/debian/sid/

(32-Bit build, but easily buildable for amd64 as well)

> I've turned two more machines (a HP Z800 with SATA disk and a Lenovo
> X220 with SSD disk) into NFSv3 servers, repeated the same tests, and
> found similar performance on the Z800, but 20x faster on the SSD (which
> can support more IOPS)

Okay, then you want more IOPS.

> > And wait I/O is quite high.
> > 
> > Thus it seems this workload can be faster with faster / more disks or a
> > RAID controller with battery (and disabling barriers / cache flushes).
> 
> You mean barrier=0,data=writeback?  Or just barrier=0,data=ordered?

I meant data=ordered. As mentioned by Andreas data=journal could yield a 
improvement. I'd suggest trying to but the journal onto a different disk then 
in order to avoid head seeks during writeout of journal data to its final 
location.

> In theory that sounds good, but in practice I understand it creates some
> different problems, eg:
> 
> - monitoring the battery, replacing it periodically
> 
> - batteries only hold the charge for a few hours, so if there is a power
> outage on a Sunday, someone tries to turn on the server on  Monday
> morning and the battery has died, cache is empty and disk is corrupt

Hmmm, from what I know there are NVRAM based controllers that can hold the 
cached data for several days.

> - some RAID controllers (e.g. HP SmartArray) insist on writing their
> metadata to all volumes - so you become locked in to the RAID vendor.  I
> prefer to just use RAID1 or RAID10 with Linux md onto the raw disks.  On
> some Adaptec controllers, `JBOD' mode allows md to access the disks
> directly, although I haven't verified that yet.

I see no reason why SoftRAID cannot be used with a NVRAM based controller.
 
> I'm tempted to just put a UPS on the server and enable NFS `async' mode,
> and avoid running anything on the server that may cause a crash.

A UPS on the server won't make "async" safe. If the server crashes you still 
can loose data.

Ciao,
-- 
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html