[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <4FA85960.6040703@pocock.com.au>
Date: Tue, 08 May 2012 01:23:12 +0200
From: Daniel Pocock <daniel@...ock.com.au>
To: Martin Steigerwald <Martin@...htvoll.de>
CC: Andreas Dilger <adilger@...ger.ca>, linux-ext4@...r.kernel.org
Subject: Re: ext4, barrier, md/RAID1 and write cache
On 08/05/12 00:24, Martin Steigerwald wrote:
> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>
>> On 07/05/12 20:59, Martin Steigerwald wrote:
>>
>>> Am Montag, 7. Mai 2012 schrieb Daniel Pocock:
>>>
>>>>> Possibly the older disk is lying about doing cache flushes. The
>>>>> wonderful disk manufacturers do that with commodity drives to make
>>>>> their benchmark numbers look better. If you run some random IOPS
>>>>> test against this disk, and it has performance much over 100 IOPS
>>>>> then it is definitely not doing real cache flushes.
>>>>>
>>> […]
>>>
>>> I think an IOPS benchmark would be better. I.e. something like:
>>>
>>> /usr/share/doc/fio/examples/ssd-test
>>>
>>> (from flexible I/O tester debian package, also included in upstream
>>> tarball of course)
>>>
>>> adapted to your needs.
>>>
>>> Maybe with different iodepth or numjobs (to simulate several threads
>>> generating higher iodepths). With iodepth=1 I have seen 54 IOPS on a
>>> Hitachi 5400 rpm harddisk connected via eSATA.
>>>
>>> Important is direct=1 to bypass the pagecache.
>>>
>> Thanks for suggesting this tool, I've run it against the USB disk and
>> an LV on my AHCI/SATA/md array
>>
>> Incidentally, I upgraded the Seagate firmware (model 7200.12 from CC34
>> to CC49) and one of the disks went offline shortly after I brought the
>> system back up. To avoid the risk that a bad drive might interfere
>> with the SATA performance, I completely removed it before running any
>> tests. Tomorrow I'm out to buy some enterprise grade drives, I'm
>> thinking about Seagate Constellation SATA or even SAS.
>>
>> Anyway, onto the test results:
>>
>> USB disk (Seagate 9SD2A3-500 320GB):
>>
>> rand-write: (groupid=3, jobs=1): err= 0: pid=22519
>> write: io=46680KB, bw=796512B/s, iops=194, runt= 60012msec
>> slat (usec): min=13, max=25264, avg=106.02, stdev=525.18
>> clat (usec): min=993, max=103568, avg=20444.19, stdev=11622.11
>> bw (KB/s) : min= 521, max= 1224, per=100.06%, avg=777.48,
>> stdev=97.07 cpu : usr=0.73%, sys=2.33%, ctx=12024, majf=0,
>> minf=20 IO depths : 1=0.1%, 2=0.1%, 4=100.0%, 8=0.0%, 16=0.0%,
>> 32=0.0%,
>>
> Please repeat the test with iodepth=1.
>
For the USB device:
rand-write: (groupid=3, jobs=1): err= 0: pid=11855
write: io=49320KB, bw=841713B/s, iops=205, runt= 60001msec
slat (usec): min=67, max=6234, avg=112.62, stdev=136.92
clat (usec): min=684, max=97358, avg=4737.20, stdev=4824.08
bw (KB/s) : min= 588, max= 1029, per=100.46%, avg=824.74, stdev=84.47
cpu : usr=0.64%, sys=2.89%, ctx=12751, majf=0, minf=21
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued r/w: total=0/12330, short=0/0
lat (usec): 750=0.02%, 1000=0.48%
lat (msec): 2=1.05%, 4=66.65%, 10=26.32%, 20=1.46%, 50=3.99%
lat (msec): 100=0.03%
and for the SATA disk:
rand-write: (groupid=3, jobs=1): err= 0: pid=12256
write: io=28020KB, bw=478168B/s, iops=116, runt= 60005msec
slat (usec): min=58, max=132637, avg=110.51, stdev=1623.80
clat (msec): min=2, max=206, avg= 8.44, stdev= 7.10
bw (KB/s) : min= 95, max= 566, per=100.24%, avg=467.11, stdev=97.64
cpu : usr=0.36%, sys=1.17%, ctx=7196, majf=0, minf=21
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
issued r/w: total=0/7005, short=0/0
lat (msec): 4=6.31%, 10=69.54%, 20=22.68%, 50=0.63%, 100=0.76%
lat (msec): 250=0.09%
> 194 IOPS appears to be highly unrealistic unless NCQ or something like
> that is in use. At least if thats a 5400/7200 RPM sata drive (didn´t check
> vendor information).
>
>
The SATA disk does have NCQ
USB disk is supposed to be 5400RPM, USB2, but reporting iops=205
SATA disk is 7200 RPM, 3 Gigabit SATA, but reporting iops=116
Does this suggest that the USB disk is caching data but telling Linux
the data is on disk?
>> The IOPS scores look similar, but I checked carefully and I'm fairly
>> certain the disks were mounted correctly when the tests ran.
>>
>> Should I run this tool over NFS, will the results be meaningful?
>>
>> Given the need to replace a drive anyway, I'm really thinking about one
>> of the following approaches:
>> - same controller, upgrade to enterprise SATA drives
>> - buy a dedicated SAS/SATA controller, upgrade to enterprise SATA
>> drives
>> - buy a dedicated SAS/SATA controller, upgrade to SAS drives
>>
>> My HP N36L is quite small, one PCIe x16 slot, the internal drive cage
>> has an SFF-8087 (mini SAS) plug, so I'm thinking I can grab something
>> small like the Adaptec 1405 - will any of these solutions offer a
>> definite win with my NFS issues though?
>>
> First I would like to understand more closely what your NFS issues are.
> Before throwing money at the problem its important to understand what the
> problem actually is.
>
>
When I do things like unpacking a large source tarball, iostat reports
throughput to the drive between 500-1000kBytes/second
When I do the same operation onto the USB drive over NFS, I see over
5000kBytes/second - but it appears from the iops test figures that the
USB drive is cheating, so we'll ignore that.
- if I just dd to the SATA drive over NFS (with conv=fsync), I see much
faster speeds
- if I'm logged in to the server, and I unpack the same tarball onto the
same LV, the operation completes at 30MBytes/sec
It is a gigabit network and I think that the performance of the dd
command proves it is not something silly like a cable fault (I have come
across such faults elsewhere though)
> Anyway, 15000 RPM SAS drives should give you more IOPS than 7200 RPM SATA
> drives, but SATA drives are cheaper and thus you could - depending on RAID
> level - increase IOPS by just using more drives.
>
>
I was thinking about the large (2TB or 3TB) 7200 RPM SAS or SATA drives
in the Seagate `Constellation' enterprise drive range. I need more
space anyway, and I need to replace the drive that failed, so I have to
spend some money anyway - I just want to throw it in the right direction
(e.g. buying a drive, or if the cheap on-board SATA controller is a
bottleneck or just extremely unsophisticated, I don't mind getting a
dedicated controller)
For example, if I knew that the controller is simply not suitable with
barriers, NFS, etc and that a $200 RAID card or even a $500 RAID card
will guarantee better performance with my current kernel, I would buy
that. (However, I do want to use md RAID rather than a proprietary
format, so any RAID card would be in JBOD mode)
> But still first I´d like to understand *why* its slow.
>
> What does
>
> iostat -x -d -m 5
> vmstat 5
>
> say when excersing the slow (and probably a faster) setup? See [1].
>
>
All the iostat output is typically like this:
Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s
avgrq-sz avgqu-sz await svctm %util
dm-23 0.00 0.00 0.20 187.60 0.00 0.81
8.89 2.02 10.79 5.07 95.20
dm-23 0.00 0.00 0.20 189.80 0.00 0.91
9.84 1.95 10.29 4.97 94.48
dm-23 0.00 0.00 0.20 228.60 0.00 1.00
8.92 1.97 8.58 4.10 93.92
dm-23 0.00 0.00 0.20 231.80 0.00 0.98
8.70 1.96 8.49 4.06 94.16
dm-23 0.00 0.00 0.20 229.20 0.00 0.94
8.40 1.92 8.39 4.10 94.08
and vmstat:
procs -----------memory---------- ---swap-- -----io---- -system--
----cpu----
r b swpd free buff cache si so bi bo in cs us sy
id wa
...
0 1 0 6881772 118660 576712 0 0 1 1033 720 1553 0 2
60 38
0 1 0 6879068 120220 577892 0 0 1 918 793 1595 0 2
56 41
0 1 0 6876208 122200 578684 0 0 1 1055 767 1731 0 2
67 31
1 1 0 6873356 124176 579392 0 0 1 1014 742 1688 0 2
66 32
0 1 0 6870628 126132 579904 0 0 1 1007 753 1683 0 2
66 32
and nfsstat -s -o all -l -Z5
nfs v3 server total: 319
------------- ------------- --------
nfs v3 server getattr: 1
nfs v3 server setattr: 126
nfs v3 server access: 6
nfs v3 server write: 61
nfs v3 server create: 61
nfs v3 server mkdir: 3
nfs v3 server commit: 61
> [1]
> http://xfs.org/index.php/XFS_FAQ#Q:_What_information_should_I_include_when_reporting_a_problem.3F
>
>
I've also tested onto btrfs and the performance was equally bad, so it
may not be an ext4 issue
The environment is:
Linux srv1 3.2.0-0.bpo.2-amd64 #1 SMP Mon Apr 23 08:38:01 UTC 2012
x86_64 GNU/Linux
(Debian squeeze)
Kernel NFS v3
HP N36L server, onboard AHCI
md RAID1 as a 1TB device (/dev/md2)
/dev/md2 is a PV for LVM - no other devices attached
As mentioned before, I've tried with and without write cache.
dmesg reports that ext4 (and btrfs) seem to be happy to accept the
barrier=1 or barrier=0 setting with the drives.
dmesg and hdparm also appear to report accurate information about write
cache status.
> (quite some of this should be relevant when reporting with ext4 as well)
>
> As for testing with NFS: I except the values to drop. NFS has quite some
> protocol overhead due to network roundtrips. On my nasic tests NFSv4 even
> more so than NFSv3. As for NFS I suggest trying nfsiostat python script
> from newer nfs-utils. It also shows latencies.
>
I agree - but 500kBytes/sec is just so much slower than anything I've
seen with any IO device in recent years. I don't expect to get 90% of
the performance of a local disk, but is getting 30-50% reasonable?
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists