linux-kernel - Re: limits on raid

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <467BFF12.60200@dgreaves.com>
Date:	Fri, 22 Jun 2007 17:55:46 +0100
From:	David Greaves <david@...eaves.com>
To:	Bill Davidsen <davidsen@....com>
Cc:	david@...g.hm, Neil Brown <neilb@...e.de>,
	linux-kernel@...r.kernel.org, linux-raid@...r.kernel.org
Subject: Re: limits on raid

Bill Davidsen wrote:
> David Greaves wrote:
>> david@...g.hm wrote:
>>> On Fri, 22 Jun 2007, David Greaves wrote:
>> If you end up 'fiddling' in md because someone specified 
>> --assume-clean on a raid5 [in this case just to save a few minutes 
>> *testing time* on system with a heavily choked bus!] then that adds 
>> *even more* complexity and exception cases into all the stuff you 
>> described.
> 
> A "few minutes?" Are you reading the times people are seeing with 
> multi-TB arrays? Let's see, 5TB at a rebuild rate of 20MB... three days. 
Yes. But we are talking initial creation here.

> And as soon as you believe that the array is actually "usable" you cut 
> that rebuild rate, perhaps in half, and get dog-slow performance from 
> the array. It's usable in the sense that reads and writes work, but for 
> useful work it's pretty painful. You either fail to understand the 
> magnitude of the problem or wish to trivialize it for some reason.
I do understand the problem and I'm not trying to trivialise it :)

I _suggested_ that it's worth thinking about things rather than jumping in to 
say "oh, we can code up a clever algorithm that keeps track of what stripes have 
valid parity and which don't and we can optimise the read/copy/write for valid 
stripes and use the raid6 type read-all/write-all for invalid stripes and then 
we can write a bit extra on the check code to set the bitmaps......"

Phew - and that lets us run the array at semi-degraded performance (raid6-like) 
for 3 days rather than either waiting before we put it into production or 
running it very slowly.
Now we run this system for 3 years and we saved 3 days - hmmm IS IT WORTH IT?

What happens in those 3 years when we have a disk fail? The solution doesn't 
apply then - it's 3 days to rebuild - like it or not.

> By delaying parity computation until the first write to a stripe only 
> the growth of a filesystem is slowed, and all data are protected without 
> waiting for the lengthly check. The rebuild speed can be set very low, 
> because on-demand rebuild will do most of the work.
I am not saying you are wrong.
I ask merely if the balance of benefit outweighs the balance of complexity.

If the benefit were 24x7 then sure - eg using hardware assist in the raid calcs 
- very useful indeed.

>> I'm very much for the fs layer reading the lower block structure so I 
>> don't have to fiddle with arcane tuning parameters - yes, *please* 
>> help make xfs self-tuning!
>>
>> Keeping life as straightforward as possible low down makes the upwards 
>> interface more manageable and that goal more realistic... 
> 
> Those two paragraphs are mutually exclusive. The fs can be simple 
> because it rests on a simple device, even if the "simple device" is 
> provided by LVM or md. And LVM and md can stay simple because they rest 
> on simple devices, even if they are provided by PATA, SATA, nbd, etc. 
> Independent layers make each layer more robust. If you want to 
> compromise the layer separation, some approach like ZFS with full 
> integration would seem to be promising. Note that layers allow 
> specialized features at each point, trading integration for flexibility.

That's a simplistic summary.
You *can* loosely couple the layers. But you can enrich the interface and 
tightly couple them too - XFS is capable (I guess) of understanding md more 
fully than say ext2.
XFS would still work on a less 'talkative' block device where performance wasn't 
as important (USB flash maybe, dunno).


> My feeling is that full integration and independent layers each have 
> benefits, as you connect the layers to expose operational details you 
> need to handle changes in those details, which would seem to make layers 
> more complex.
Agreed.

> What I'm looking for here is better performance in one 
> particular layer, the md RAID5 layer. I like to avoid unnecessary 
> complexity, but I feel that the current performance suggests room for 
> improvement.

I agree there is room for improvement.
I suggest that it may be more fruitful to write a tool called "raid5prepare"
that writes zeroes/ones as appropriate to all component devices and then you can 
use --assume-clean without concern. That could look to see if the devices are 
scsi or whatever and take advantage of the hyperfast block writes that can be done.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/