linux-kernel - Re: Mis-Design of Btrfs?

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <4E2077A4.4010307@redhat.com>
Date:	Fri, 15 Jul 2011 18:23:48 +0100
From:	Ric Wheeler <rwheeler@...hat.com>
To:	david@...g.hm
CC:	Chris Mason <chris.mason@...cle.com>, NeilBrown <neilb@...e.de>,
	Nico Schottelius <nico-lkml-20110623@...ottelius.org>,
	LKML <linux-kernel@...r.kernel.org>,
	linux-btrfs <linux-btrfs@...r.kernel.org>,
	Alasdair G Kergon <agk@...hat.com>
Subject: Re: Mis-Design of Btrfs?

On 07/15/2011 06:01 PM, david@...g.hm wrote:
> On Fri, 15 Jul 2011, Ric Wheeler wrote:
>
>> On 07/15/2011 05:23 PM, david@...g.hm wrote:
>>> On Fri, 15 Jul 2011, Chris Mason wrote:
>>>
>>>> Excerpts from Ric Wheeler's message of 2011-07-15 08:58:04 -0400:
>>>>> On 07/15/2011 12:34 PM, Chris Mason wrote:
>>>>
>>>> By bubble up I mean that if you have multiple layers capable of doing
>>>> retries, the lowest levels would retry first.  Basically by the time we
>>>> get an -EIO_ALREADY_RETRIED we know there's nothing that lower level can
>>>> do to help.
>>>
>>> the problem with doing this is that it can end up stalling the box for 
>>> significant amounts of time while all the retries happen.
>>>
>>> we already see this happening today where a disk read failure is retried 
>>> multiple times by the disk, multiple times by the raid controller, and then 
>>> multiple times by Linux, resulting is multi-minute stalls when you hit a 
>>> disk error in some cases.
>>>
>>> having the lower layers do the retries automatically runs the risk of making 
>>> this even worse.
>>>
>>> This needs to be able to be throttled by some layer that can see the entire 
>>> picture (either by cutting off the retries after a number, after some time, 
>>> or by spacing out the retries to allow other queries to get in and let the 
>>> box do useful work in the meantime)
>>>
>>> David Lang
>>>
>>
>> That should not be an issue - we have a "fast fail" path for IO that should 
>> avoid retrying just for those reasons (i.e., for multi-path or when 
>> recovering a flaky drive).
>>
>> This is not a scheme for unbounded retries. If you have a 3 disk mirror in 
>> RAID1, you would read the data no more than 2 extra times and almost never 
>> more than once.  That should be *much* faster than the multiple-second long 
>> timeout you see when waiting for SCSI timeout to fire, etc.
>
> this issue is when you stack things.
>
> what if you have a 3 piece raid 1 on top of a 4 piece raid 6?
>
> so you have 3 raid1 retries * N raid 6 retries. depending on the order that 
> you do the retries in, and how long it takes that try to fail, this could 
> start to take significant amounts of time.
>
> if you do a retry on the lower level first, the raid 6 could try several 
> different ways to combine the drives to get the valid data (disks 1,2  2,3 3,4 
> 1,3 1,4 2,4 1,2,3 1,2,4 1,3,4 2,3,4) add more disks and it gets worse fast.
>
> add more layers and you multiple the number of possible retries.
>
> my guess is that changing to a different method at the upper level is going to 
> avoid the problem area faster then doing so at a lower level (because there is 
> less hardware in common with the method that just gave the wrong answer)
>
> David Lang

At some point, the question is why would you do that?  Two parity drives for 
each 4 drive RAID-6 set and then mirror that 3 times (12 drives in total, only 2 
data drives)? Better off doing a 4 way mirror :)

I think that you are still missing the point.

If the lowest layer can repair the data, it would return the first validated 
answer. The major time sync would be the IO to read the sectors from each of the 
4 drives in your example, not computing the various combination of parity or 
validating (all done in memory).

If the RAID-6 layer failed, you would do the same for each of the mirrors which 
would read the IO from each of their RAID-6 low levels exactly once as well (and 
then verify in memory).

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/