linux-kernel - Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20100817131915.GB2963@lst.de>
Date:	Tue, 17 Aug 2010 15:19:15 +0200
From:	Christoph Hellwig <hch@....de>
To:	Tejun Heo <tj@...nel.org>
Cc:	Christoph Hellwig <hch@....de>, jaxboe@...ionio.com,
	linux-fsdevel@...r.kernel.org, linux-scsi@...r.kernel.org,
	linux-ide@...r.kernel.org, linux-kernel@...r.kernel.org,
	linux-raid@...r.kernel.org, James.Bottomley@...e.de, tytso@....edu,
	chris.mason@...cle.com, swhiteho@...hat.com,
	konishi.ryusuke@....ntt.co.jp, dm-devel@...hat.com, vst@...b.net,
	jack@...e.cz, rwheeler@...hat.com, hare@...e.de
Subject: Re: [PATCHSET block#for-2.6.36-post] block: replace barrier with sequenced flush

On Tue, Aug 17, 2010 at 11:59:38AM +0200, Tejun Heo wrote:
> I'm not really convinced that would help much.  If bisecting can point
> to the conversion as the culprit for whatever kind of failure,
> wouldn't that be enough?  No matter what we do the conversion will be
> a single step thing.  If we make the filesystems enforce the ordering
> first and then relax ordering in the block layer, bisection would
> still just point at the later patch.  The same goes for md/dm, the
> best we can find out would be whether the conversion is correct or not
> anyway.

The filesystems already enforce the ordering, except reiserfs which
opts out if the barrier options is set. 

> I'm not against restructuring the patchset if it makes more sense but
> it just feels like it would be a bit pointless effort (and one which
> would require much tighter coordination among different trees) at this
> point.  Am I missing something?

What other trees do you mean?  The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it.  Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,
and removing about 10 lines of code in reiserfs.

> > We only need the special md_flush_request handling for
> > empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the
> > flag propagated to the underlying devices.
> 
> Hmm, not really, the WRITE should happen after all the data in cache
> are committed to NV media, meaning that empty FLUSH should already
> have finished by the time the WRITE starts.

You're right.

> >>  	while ((bio = bio_list_pop(writes))) {
> >> -		if (unlikely(bio_empty_barrier(bio))) {
> >> +		if ((bio->bi_rw & REQ_FLUSH) && !bio_has_data(bio)) {
> > 
> > I kept bio_empty_barrier as bio_empty_flush, which actually is a quite
> > useful macro for the bio based drivers.
> 
> Hmm... maybe.  The reason why I removed bio_empty_flush() was that
> except for the front-most sequencer (block layer for all the request
> based ones and the front-most make_request for bio based ones), it
> doesn't make sense to see REQ_FLUSH + data bios.  They should be
> sequenced at the front-most stage anyway, so I didn't have much use
> for them.  Those code paths couldn't deal with REQ_FLUSH + data bios
> anyway.

The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers.  But I think it's a rather useful
helper for them.  Either way, it's not a big issue and either way is
fine with me.

> >> +		if (bio->bi_rw & REQ_FLUSH) {
> >>  			/*
> >> -			 * There can be just one barrier request so we use
> >> +			 * There can be just one flush request so we use
> >>  			 * a per-device variable for error reporting.
> >>  			 * Note that you can't touch the bio after end_io_acct
> >>  			 */
> >> -			if (!md->barrier_error && io_error != -EOPNOTSUPP)
> >> -				md->barrier_error = io_error;
> >> +			if (!md->flush_error)
> >> +				md->flush_error = io_error;
> > 
> > And we certainly do not need any special casing here.  See my patch.
> 
> I wasn't sure about that part.  You removed store_flush_error(), but
> DM_ENDIO_REQUEUE should still have higher priority than other
> failures, no?

Which priority?

> >> +static void process_flush(struct mapped_device *md, struct bio *bio)
> >>  {
> >> +	md->flush_error = 0;
> >> +
> >> +	/* handle REQ_FLUSH */
> >>  	dm_wait_for_completion(md, TASK_UNINTERRUPTIBLE);
> >>
> >> -	bio_init(&md->barrier_bio);
> >> -	md->barrier_bio.bi_bdev = md->bdev;
> >> -	md->barrier_bio.bi_rw = WRITE_BARRIER;
> >> -	__split_and_process_bio(md, &md->barrier_bio);
> >> +	bio_init(&md->flush_bio);
> >> +	md->flush_bio.bi_bdev = md->bdev;
> >> +	md->flush_bio.bi_rw = WRITE_FLUSH;
> >> +	__split_and_process_bio(md, &md->flush_bio);
> > 
> > There's not need to use a separate flush_bio here.
> > __split_and_process_bio does the right thing for empty REQ_FLUSH
> > requests.  See my patch for how to do this differenty.  And yeah,
> > my version has been tested.
> 
> But how do you make sure REQ_FLUSHes for preflush finish before
> starting the write?

Hmm, okay.  I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/