linux-kernel - Re: Slow disks.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:	Fri, 24 Dec 2010 12:40:08 +0100
From:	Rogier Wolff <R.E.Wolff@...Wizard.nl>
To:	Greg Freemyer <greg.freemyer@...il.com>
Cc:	Jaap Crezee <jaap@....nl>, Jeff Moyer <jmoyer@...hat.com>,
	Rogier Wolff <R.E.Wolff@...wizard.nl>,
	Bruno Prémont <bonbons@...ux-vserver.org>,
	linux-kernel@...r.kernel.org, linux-ide@...r.kernel.org
Subject: Re: Slow disks.

On Thu, Dec 23, 2010 at 05:09:43PM -0500, Greg Freemyer wrote:
> On Thu, Dec 23, 2010 at 2:10 PM, Jaap Crezee <jaap@....nl> wrote:
> > On 12/23/10 19:51, Greg Freemyer wrote:
> >> On Thu, Dec 23, 2010 at 12:47 PM, Jeff Moyer<jmoyer@...hat.com>  wrote:
> >> I suspect a mailserver on a raid 5 with large chunksize could be a lot
> >> worse than 2x slower.  But most of the blame is just raid 5.
> >
> > Hmmm, well if this really is so.. I use raid 5 to not "spoil" the storage
> > space of one disk. I am using some other servers with raid 5 md's which
> > seems to be running just fine; even under higher load than the machine we
> > are talking about.
> >
> > Looking at the vmstat block io the typical load (both write and read) seems
> > to be less than 20 blocks per second. Will this drop the performance of the
> > array (measured by dd if=/dev/md<x> of=/dev/null bs=1M) below 3MB/secs?
> >
> 
> You clearly have problems more significant than your raid choice, but
> hopefully you will find the below informative anyway.
> 
> ====
> 
> The above is a meaningless performance tuning test for a email server,
> but assuming it was a useful test for you:
> 
> With bs=1MB you should have optimum performance with a 3-disk raid5
> and 512KB chunks.
> 
> The reason is that a full raid stripe for that is 1MB  (512K data +
> 512K data + 512K parity = 1024K data)
> 
> So the raid software should see that as a full stripe update and not
> have to read in any of the old data.
> 
> Thus at the kernel level it is just:
> 
> write data1 chunk
> write data2 chunk
> write parity chunk
> 
> All those should happen in parallel, so a raid 5 setup for 1MB writes
> is actually just about optimal!

You are assuming that the kernel is blind and doesn't do any
readaheads. I've done some tests and even when I run dd with a
blocksize of 32k, the average request sizes that are hitting the disk
are about 1000k (or 1000 sectors I don't know what units that column
are in when I run with -k option).

So your argument that "it fits exactly when your blocksize is 1M, so
it is obvious that 512k blocksizes are optimal" doesn't hold water.

When the blocksize is too large, the system will be busy reading and
waiting for one disk, while leaving the second (and third and ...)
disk idle. Just because readahead is final. You indeed want the
readahead to hit many disks at the same time so that when you get
around to reading the data from the drives they can run at close to
bus speed.

When the block size is too small you'll spend to much time splitting
say a 1M readahead on the MD device into 16 64k chunks for individual
drives, and then (if that works) merging them back together again for
those drives to prevent the over head of too many commands to each
drive (for a 4-drive raid5, the first and fourth block are likely to
be consecutive on the same drive...). Hmmm. but those would have to go
into different spots in a buffer, so it might simply ahve to incur
that extra overhead....

> Anything smaller than a 1 stripe write is where the issues occur,
> because then you have the read-modify-write cycles.

Yes. But still they shouldn't be as heavy as we are seeing.  Besides
doing the "big searches" on my 8T array, I also sometimes write "lots
of small files". I'll see how many I can mange on that server.... 

	Roger. 

> 
> (And yes, the linux mdraid layer recognizes full stripe writes and
> thus skips the read-modify portion of the process.)
> 
> >> ie.
> >> write 4K from userspace
> >>
> >> Kernel
> >> Read old primary data, wait for data to actually arrive
> >> Read old parity data, wait again
> >> modify both for new data
> >> write primary data to drive queue
> >> write parity data to drive queue
> >
> > What if I (theoratically) change the chunksize to 4kb? (I can try that in
> > the new server...).
> 
> 4KB random writes is really just too small for an efficient raid 5
> setup.  Since that's your real workload, I'd get away from raid 5.
> 
> If you really want to optimize a 3-disk raid-5 for random 4K writes,
> you need to drop down to 2K chunks which gives you a 4K stripe.  I've
> never seen chunks that small used, so I have no idea how it would
> work.
> 
> ===> fyi: If reliability is one of the things pushing you away from raid-1
> 
> A 2 disk raid-1 is more reliable than a 3-disk raid-5.
> 
> The math is, assume each of your drives has a one in 1000 chance of
> dieing on a specific day.
> 
> So a raid-1 has a 1 in a million chance of a dual failure on that same
> specific day.
> 
> And a raid-5 would have 3 in a million chances of a dual failure on
> that same specific day.  ie. drive 1 and 2 can fail that day, or 1 and
> 3, or 2 and 3.
> 
> So a 2 drive raid-1 is 3 times as reliable as a 3-drive raid-5.
> 
> If raid-1 still makes you uncomfortable, then go with a 3-disk mirror
> (raid 1 or raid 10 depending on what you need.)
> 
> You can get 2TB sata drives now for about $100 on sale, so you could
> do a 2 TB 3-disk raid-1 for $300.  Not a bad price at all in my
> opinion.
> 
> fyi: I don't know if "enterprise" drives cost more or not.  But it is

They do. They cost about twice as much. 

> important you use those in a raid setup.  The reason being normal
> desktop drives have retry logic built into the drive that can take
> from 30 to 120 seconds.  Enterprise drives have fast fail logic that
> allows a media error to rapidly be reported back to the kernel so that
> it can read that data from the alternate drives available in a raid.

You're repeating what WD says about their enterprise drives versus
desktop drives. I'm pretty sure that they believe what they are saying
to be true. And they probably have done tests to see support for their
theory. But for Linux it simply isn't true.

WD apparently tested their drives with a certain unnamed operating
system. That operating system may wait for up to two minutes for a
drive to report "bad block" or "succesfully remapped this block, and
here is your data".

>From my experience, it is unlikely that a desktop user will sit behind
his/her workstation for two minutes for the screen to unfreeze while
the drive goes into deep recovery. The reset button will have been
pressed by that time. Both on Linux /and/ that other OS. 


Moreover, Linux uses a 30 second timeout. If a drive doesn't respond
in 30 second, it will be reset and the request tried again. I don't
think the drive will restart the "deep recovery" procedure after a
reset-identify-reread cycle where it left off. It will start all
over. 

The SCSI disks have it all figured out. There you can use standard
commands to set the maximum recovery time. If you set it to "20ms" the
drive can calculate that it has ONE retry option on the next
revolution (or two if it runs at more than xxx RPM) and nothing else.

WD claims a RAID array might quickly switch to a different drive if it
knows the block cannot be read from one drive. This is true. But at
least for Linux software raid, the drive will immediately be bumped
from the array, and never be used/read/written again until it is
replaced.

Now they might have a point there. For a drive a limited amount of bad
blocks, it might be MUCH better to mark the drive as "in desparate
need of replacement" instead of "Failed". One thing you can do to help
the drive is to rewrite the bad sectors with the recalculated
data. The drive can then remap the sectors.

We see MUCH too often raid arrays that lose a drive evict it from the
RAID and everything keeps on working, so nobody wakes up. Only after a
second drive fails, things stop working and the datarecovery company
gets called into action. Often we have a drive with a few bad blocks
and months-old data, and a totally failed drive which is neccesary for
a full recovery. It's much better to keep the failed/failing drive in
the array and up-to-date during the time that you're pushing the
operator to get it replaced.

	Roger. 

-- 
** R.E.Wolff@...Wizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/