linux-kernel - Re: [PATCH 00/16] DRBD: a block device for HA clusters

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:	Tue, 05 May 2009 14:09:45 +0000
From:	James Bottomley <James.Bottomley@...senPartnership.com>
To:	Philipp Reisner <philipp.reisner@...bit.com>
Cc:	david@...g.hm, Willy Tarreau <w@....eu>,
	Bart Van Assche <bart.vanassche@...il.com>,
	Andrew Morton <akpm@...ux-foundation.org>,
	linux-kernel@...r.kernel.org, Jens Axboe <jens.axboe@...cle.com>,
	Greg KH <gregkh@...e.de>, Neil Brown <neilb@...e.de>,
	Sam Ravnborg <sam@...nborg.org>, Dave Jones <davej@...hat.com>,
	Nikanth Karthikesan <knikanth@...e.de>,
	Lars Marowsky-Bree <lmb@...e.de>,
	Kyle Moffett <kyle@...fetthome.net>,
	Lars Ellenberg <lars.ellenberg@...bit.com>
Subject: Re: [PATCH 00/16] DRBD: a block device for HA clusters

On Tue, 2009-05-05 at 10:21 +0200, Philipp Reisner wrote:
> > > When you do asynchronous replication, how do you ensure that implicit
> > > write-after-write dependencies in the stream of writes you get from
> > > the file system above, are not violated on the secondary ?
> >
> > Are you telling me drbd doesn't currently do this?
> >
> 
> No I am not. DRBD does exactly this!
> But I am wondering how that is achieved in the MD/NBD stack when running 
> in async mode.

The explanation is below.

> The issue is covered since the early days in DRBD, (back in 2000).
> The issue, and the solution we have in DRBD is described in this paper:
> 
> http://www.drbd.org/fileadmin/drbd/publications/drbd_paper_for_NLUUG_2001.pdf
> 
> > The way nbd does it (in the updated tools is to use DIRECT_IO and
> > fsync).
> 
> Is that available in the existing tools ? -- Are the updated tools
> something that will be available in the future ?

It's in the existing.

> Are you telling me md/ndb (async) doesn't currently do this ?

I just described how it doe this ... I don't quite see how that
translates into telling you it doesn't do this.

> > > There might be a disk scheduler on the secondary.
> >
> > There usually is a disk scheduler ... you just have to take the required
> > action to persuade it to preserve ordering ... a simplistic way of doing
> > this is to switch to the noop scheduler.
> 
> The issue actually goes further down the stack. Not only the in kernel
> disk scheduler might reorder something, also the driver and finally the
> drive might do so.
> 
> What we have in DRBD boils down to:
> 
> * We obey all possible write after write dependencies in the stream of
>   writes we get from the upper layers. And generate DRBD internal
>   reorder barriers for the packet stream.
> * On the secondary node we impose these barriers onto the stream of writes
>   submitted to the stack below us by either:
> 
>    - Let previously submitted write-IO drain before we submit write-IO after
>      such an DRBD barrier. (That we have since 2000 or so)
> 
>    - Additionally issue a blkdev_issue_flush()
> 
>    - Use write requests with BIO_RW_BARRIER. This method has two advantages:
>      We can continue to submit writes after the DRBD internal barrier
>      immediately, and the number of requests with BIO_RW_BARRIER can be
>      further reduced. 
>      See section 6 of
>      http://www.drbd.org/fileadmin/drbd/publications/drbd8.pdf
>      for more details, and nice illustrations.

THere's a slight error in there ... we don't use ordered tags for
barriers (yet).  I don't think it will really matter because the main
domain of ordering problems is the scheduler, which REQ_BARRIER does
cope with, it just means the queue drains for a barrier.

>      Unfortunately only high end SAN devices seem to benefit from this
>      method. For most in-machine-disk controlers this method does not
>      achieve the highest throughput.
> 
> Expressed in other words: 
> We allow reordering on the secondary node to an extend so that we can
> guarantee that no implicit write-after-write dependencies are violated.
> 
> Coming back to the idea of disabling the in Linux IO scheduler. It might
> solve the issue for some devices, but it does not guarantee to solve it.

I think you'll find the dio/fsync method above actually does solve all
of these issues (mainly because it enforces the semantics from top to
bottom in the stack).  I agree one could use more elaborate semantics
like you do for drbd, but since the simple ones worked efficiently for
md/nbd, there didn't seem to be much point.

James


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/