[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <alpine.DEB.1.10.0905030904230.15782@asgard>
Date: Sun, 3 May 2009 09:13:58 -0700 (PDT)
From: david@...g.hm
To: James Bottomley <James.Bottomley@...senPartnership.com>
cc: Willy Tarreau <w@....eu>,
Bart Van Assche <bart.vanassche@...il.com>,
Andrew Morton <akpm@...ux-foundation.org>,
Philipp Reisner <philipp.reisner@...bit.com>,
linux-kernel@...r.kernel.org, Jens Axboe <jens.axboe@...cle.com>,
Greg KH <gregkh@...e.de>, Neil Brown <neilb@...e.de>,
Sam Ravnborg <sam@...nborg.org>, Dave Jones <davej@...hat.com>,
Nikanth Karthikesan <knikanth@...e.de>,
Lars Marowsky-Bree <lmb@...e.de>,
Kyle Moffett <kyle@...fetthome.net>,
Lars Ellenberg <lars.ellenberg@...bit.com>
Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
On Sun, 3 May 2009, James Bottomley wrote:
> Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters
>
> On Sun, 2009-05-03 at 08:48 -0700, david@...g.hm wrote:
>> On Sun, 3 May 2009, James Bottomley wrote:
>>
>>> On Sun, 2009-05-03 at 08:22 -0700, david@...g.hm wrote:
>>>> On Sun, 3 May 2009, James Bottomley wrote:
>>>>
>
>>>>> This corruption situation isn't unique to replication ... any time you
>>>>> may potentially have allowed both sides to write to a data store, you
>>>>> get it, that's why it's the job of the HA harness to sort out whether a
>>>>> split brain happened and what to do about it *first*.
>>>>
>>>> but you can have packets sitting in the network buffers waiting to get to
>>>> the remote machine, then once the connection is reestablished those
>>>> packets will go out. no remounting needed., just connectivity restored.
>>>> (this isn't as bad as if the system tries to re-sync to the temprarily
>>>> unavailable drive by itself, but it can still corrupt things)
>>>
>>> This is an interesting thought, but not what happens. As soon as the HA
>>> harness stops replication, which it does at the instant failure is
>>> detected, the closure of the socket kills all the in flight network
>>> data.
>>>
>>> There is an variant of this problem that occurs with device mapper
>>> queue_if_no_path (on local disks) which does exactly what you say (keeps
>>> unsaved data around in the queue forever), but that's fixed by not using
>>> queue_if_no_path for HA. Maybe that's what you were thinking of?
>>
>> is there a mechanism in ndb that prevents it from beign mounted more than
>> once? if so then could have the same protection that DRDB has, if not it
>> is possible for it to be mounted more than once place and therefor get
>> corrupted.
>
> That's not really relevant, is it? An ordinary disk doesn't have this
> property either. Mediating simultaneous access is the job of the HA
> harness. If the device does it for you, fine, the harness can make use
> of that (as long as the device gets it right) but all good HA harnesses
> sort out the usual case where the device doesn't do it.
with a local disk you can mount it multiple times, write to it from all
the mounts, and not have any problems, because all access goes through a
common layer.
you would have this sort of problem if you used one partition as part of
multiple md arrays, but the md layer itself would detect and prevent this
(because it would see both arrays), but again, in a multi-machine
situation you don't have the common layer to do the detection.
you can rely on the HA layer to detect and prevent all of this (and
apparently there are people doing this, I wasn't aware of it), but I've
seen enough problems with every HA implementation I've dealt with over the
years (both opensource and commercial), that I would be very uncomfortable
depending on this exclusivly. having the disk replication layer detect
this adds a significant amount of safety in my eyes.
>>> There are commercial HA products based on md/nbd, so I'd say it's also
>>> hardened for harsher environments
>>
>> which ones?
>
> SteelEye LifeKeeper. It actually supports both drbd and md/nbd.
thanks for the info.
David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists