[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <201205090930.02731.ms@teamix.de>
Date: Wed, 9 May 2012 09:30:02 +0200
From: Martin Steigerwald <ms@...mix.de>
To: Daniel Pocock <daniel@...ock.com.au>
Cc: Martin Steigerwald <Martin@...htvoll.de>,
Andreas Dilger <adilger@...ger.ca>, linux-ext4@...r.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache
Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> On 08/05/12 14:55, Martin Steigerwald wrote:
> > Am Dienstag, 8. Mai 2012 schrieb Daniel Pocock:
> >> On 08/05/12 00:24, Martin Steigerwald wrote:
> >>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>> On 07/05/12 20:59, Martin Steigerwald wrote:
> >>>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
> >>>>>>> Possibly the older disk is lying about doing cache flushes. The
> >>>>>>> wonderful disk manufacturers do that with commodity drives to make
> >>>>>>> their benchmark numbers look better. If you run some random IOPS
> >>>>>>> test against this disk, and it has performance much over 100 IOPS
> >>>>>>> then it is definitely not doing real cache flushes.
> >>>>>
> >>>>> […]
> >>>>>
> >>>>> I think an IOPS benchmark would be better. I.e. something like:
> >>>>>
> >>>>> /usr/share/doc/fio/examples/ssd-test
> >>>>>
> >>>>> (from flexible I/O tester debian package, also included in upstream
> >>>>> tarball of course)
> >>>>>
> >>>>> adapted to your needs.
> >>>>>
> >>>>> Maybe with different iodepth or numjobs (to simulate several threads
> >>>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
> >>>>> Hitachi 5400 rpm harddisk connected via eSATA.
> >>>>>
> >>>>> Important is direct=1 to bypass the pagecache.
> >>>>
> >>>> Thanks for suggesting this tool, I've run it against the USB disk and
> >>>> an LV on my AHCI/SATA/md array
> >>>>
> >>>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
> >>>> to CC49) and one of the disks went offline shortly after I brought the
> >>>> system back up. To avoid the risk that a bad drive might interfere
> >>>> with the SATA performance, I completely removed it before running any
> >>>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
> >>>> thinking about Seagate Constellation SATA or even SAS.
> >>>>
> >>>> Anyway, onto the test results:
> >>>>
> >>>> USB disk (Seagate 9SD2A3-500 320GB):
> >>>>
> >>>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
> >>>>
> >>>> write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
[…]
> >>> Please repeat the test with iodepth=1.
> >>
> >> For the USB device:
> >>
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=11855
> >>
> >> write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
[…]
> >> and for the SATA disk:
> >>
> >> rand-write: (groupid=3, jobs=1): err= 0: pid=12256
> >>
> >> write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
[…]
> > […]
> >
> >> issued r/w: total=0/7005, short=0/0
> >>
> >> lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
> >> lat (msec): 250=0.09%
> >>>
> >>> 194 IOPS appears to be highly unrealistic unless NCQ or something like
> >>> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t
> >>> check vendor information).
> >>
> >> The SATA disk does have NCQ
> >>
> >> USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
> >>
> >> SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
> >>
> >> Does this suggest that the USB disk is caching data but telling Linux
> >> the data is on disk?
> >
> > Looks like it.
> >
> > Some older values for a 1.5 TB WD Green Disk:
> >
> > mango:~# fio -readonly -name iops -rw=randread -bs=512 -runtime=100
> > -iodepth 1 -filename /dev/sda -ioengine libaio -direct=1
> > [...] iops: (groupid=0, jobs=1): err= 0: pid=9939
> >
> > read : io=1,859KB, bw=19,031B/s, iops=37, runt=100024msec [...]</pre>
> >
> > mango:~# fio -readonly -name iops -rw=randread -bs=512 -runtime=100
> > -iodepth 32 -filename /dev/sda -ioengine libaio -direct=1
> > iops: (groupid=0, jobs=1): err= 0: pid=10304
> >
> > read : io=2,726KB, bw=27,842B/s, iops=54, runt=100257msec
> >
> > mango:~# hdparm -I /dev/sda | grep -i queue
> >
> > Queue depth: 32
> >
> > * Native Command Queueing (NCQ)
> >
> > - 1,5 TB Western Digital, WDC WD15EADS-00P8B0
> > - Pentium 4 mit 2,80 GHz
> > - 4 GB RAM, 32-Bit Linux
> > - Linux Kernel 2.6.36
> > - fio 1.38-1
[…]
> >> It is a gigabit network and I think that the performance of the dd
> >> command proves it is not something silly like a cable fault (I have come
> >> across such faults elsewhere though)
> >
> > What is the latency?
>
> $ ping -s 1000 192.168.1.2
> PING 192.168.1.2 (192.168.1.2) 1000(1028) bytes of data.
> 1008 bytes from 192.168.1.2: icmp_req=1 ttl=64 time=0.307 ms
> 1008 bytes from 192.168.1.2: icmp_req=2 ttl=64 time=0.341 ms
> 1008 bytes from 192.168.1.2: icmp_req=3 ttl=64 time=0.336 ms
Seems to be fine.
> >>> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM
> >>> SATA drives, but SATA drives are cheaper and thus you could -
> >>> depending on RAID level - increase IOPS by just using more drives.
> >>
> >> I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
> >> in the Seagate `Constellation' enterprise drive range. I need more
> >> space anyway, and I need to replace the drive that failed, so I have to
> >> spend some money anyway - I just want to throw it in the right direction
> >> (e.g. buying a drive, or if the cheap on-board SATA controller is a
> >> bottleneck or just extremely unsophisticated, I don't mind getting a
> >> dedicated controller)
> >>
> >> For example, if I knew that the controller is simply not suitable with
> >> barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
> >> will guarantee better performance with my current kernel, I would buy
> >> that. (However, I do want to use md RAID rather than a proprietary
> >> format, so any RAID card would be in JBOD mode)
> >
> > They point is: How much of the performance will arrive at NFS? I can't
> > say yet.
>
> My impression is that the faster performance of the USB disk was a red
> herring, and the problem really is just the nature of the NFS protocol
> and the way it is stricter about server-side caching (when sync is
> enabled) and consequently it needs more iops.
Yes, that seems to be the case here. It seems to be a small blocksize random
I/O workload with heavy fsync() usage.
You could adapt to /usr/share/doc/fio/examples/iometer-file-access-server to
benchmark such a scenario. Also fsmark simulates such a heavy fsync() based
quite well. I have packaged it for Debian, but its still in NEW queue. You can
grab it from
http://people.teamix.net/~ms/debian/sid/
(32-Bit build, but easily buildable for amd64 as well)
> I've turned two more machines (a HP Z800 with SATA disk and a Lenovo
> X220 with SSD disk) into NFSv3 servers, repeated the same tests, and
> found similar performance on the Z800, but 20x faster on the SSD (which
> can support more IOPS)
Okay, then you want more IOPS.
> > And wait I/O is quite high.
> >
> > Thus it seems this workload can be faster with faster / more disks or a
> > RAID controller with battery (and disabling barriers / cache flushes).
>
> You mean barrier=0,data=writeback? Or just barrier=0,data=ordered?
I meant data=ordered. As mentioned by Andreas data=journal could yield a
improvement. I'd suggest trying to but the journal onto a different disk then
in order to avoid head seeks during writeout of journal data to its final
location.
> In theory that sounds good, but in practice I understand it creates some
> different problems, eg:
>
> - monitoring the battery, replacing it periodically
>
> - batteries only hold the charge for a few hours, so if there is a power
> outage on a Sunday, someone tries to turn on the server on Monday
> morning and the battery has died, cache is empty and disk is corrupt
Hmmm, from what I know there are NVRAM based controllers that can hold the
cached data for several days.
> - some RAID controllers (e.g. HP SmartArray) insist on writing their
> metadata to all volumes - so you become locked in to the RAID vendor. I
> prefer to just use RAID1 or RAID10 with Linux md onto the raw disks. On
> some Adaptec controllers, `JBOD' mode allows md to access the disks
> directly, although I haven't verified that yet.
I see no reason why SoftRAID cannot be used with a NVRAM based controller.
> I'm tempted to just put a UPS on the server and enable NFS `async' mode,
> and avoid running anything on the server that may cause a crash.
A UPS on the server won't make "async" safe. If the server crashes you still
can loose data.
Ciao,
--
Martin Steigerwald - teamix GmbH - http://www.teamix.de
gpg: 19E3 8D42 896F D004 08AC A0CA 1E10 C593 0399 AE90
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists