linux-kernel - Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20110124203155.GA32261@tux1.beaverton.ibm.com>
Date:	Mon, 24 Jan 2011 12:31:55 -0800
From:	"Darrick J. Wong" <djwong@...ibm.com>
To:	Tejun Heo <tj@...nel.org>
Cc:	Vivek Goyal <vgoyal@...hat.com>, axboe@...nel.dk, tytso@....edu,
	shli@...nel.org, neilb@...e.de, adilger.kernel@...ger.ca,
	jack@...e.cz, snitzer@...hat.com, linux-kernel@...r.kernel.org,
	kmannth@...ibm.com, cmm@...ibm.com, linux-ext4@...r.kernel.org,
	rwheeler@...hat.com, hch@....de, josef@...hat.com
Subject: Re: [PATCH 3/3] block: reimplement FLUSH/FUA to support merge

On Sun, Jan 23, 2011 at 11:25:26AM +0100, Tejun Heo wrote:
> Hello,
> 
> On Fri, Jan 21, 2011 at 01:56:17PM -0500, Vivek Goyal wrote:
> > > + * Currently, the following conditions are used to determine when to issue
> > > + * flush.
> > > + *
> > > + * C1. At any given time, only one flush shall be in progress.  This makes
> > > + *     double buffering sufficient.
> > > + *
> > > + * C2. Flush is not deferred if any request is executing DATA of its
> > > + *     sequence.  This avoids issuing separate POSTFLUSHes for requests
> > > + *     which shared PREFLUSH.
> > 
> > Tejun, did you mean "Flush is deferred" instead of "Flush is not deferred"
> > above?
> 
> Oh yeah, I did.  :-)
> 
> > IIUC, C2 might help only if requests which contain data are also going to 
> > issue postflush. Couple of cases come to mind.
> 
> That's true.  I didn't want to go too advanced on it.  I wanted
> something which is fairly mechanical (without intricate parameters)
> and effective enough for common cases.
> 
> > - If queue supports FUA, I think we will not issue POSTFLUSH. In that
> >   case issuing next PREFLUSH which data is in flight might make sense.
> >
> > - Even if queue does not support FUA and we are only getting requests
> >   with REQ_FLUSH then also waiting for data requests to finish before
> >   issuing next FLUSH might not help.
> > 
> > - Even if queue does not support FUA and say we have a mix of REQ_FUA
> >   and REQ_FLUSH, then this will help only if in a batch we have more
> >   than 1 request which is going to issue POSTFLUSH and those postflush
> >   will be merged.
> 
> Sure, not applying C2 and 3 if the underlying device supports REQ_FUA
> would probably be the most compelling change of the bunch; however,
> please keep in mind that issuing flush as soon as possible doesn't
> necessarily result in better performance.  It's inherently a balancing
> act between latency and throughput.  Even inducing artificial issue
> latencies is likely to help if done right (as the ioscheds do).
> 
> So, I think it's better to start with something simple and improve it
> with actual testing.  If the current simple implementation can match
> Darrick's previous numbers, let's first settle the mechanisms.  We can

Yep, the fsync-happy numbers more or less match... at least for 2.6.37:
http://tinyurl.com/4q2xeao

I'll give 2.6.38-rc2 a try later, though -rc1 didn't boot for me, so these
numbers are based on a backport to .37. :(

In general, the effect of this patchset is to change a 100% drop in fsync-happy
performance into a 20% drop.  As always, the higher the average flush time, the
more the storage system benefits from having flush coordination.  The only
exception to that is elm3b231_ipr, which is a md array of disks that are
attached to a controller that is now throwing errors, so I'm not sure I
entirely trust that machine's numbers.

As for elm3c44_sas, I'm not sure why enabling flushes always increases
performance, other than to say that I suspect it has something to do with
md-raid'ing disk trays together, because elm3a4_sas and elm3c71_extsas consist
of the same configuration of disk trays, only without the md.  I've also been
told by our storage folks that md atop raid trays is not really a recommended
setup anyway.

The long and short of it is that this latest patchset looks and delivers the
behavior that I was aiming for. :)

> tune the latency/throughput balance all we want later.  Other than the
> double buffering contraint (which can be relaxed too but I don't think
> that would be necessary or a good idea) things can be easily adjusted
> in blk_kick_flush().  It's intentionally designed that way.
> 
> > - Ric Wheeler was once mentioning that there are boxes which advertise
> >   writeback cache but are battery backed so they ignore flush internally and
> >   signal completion immediately. I am not sure how prevalent those
> >   cases are but I think waiting for data to finish will delay processing
> >   of new REQ_FLUSH requests in pending queue for such array. There
> >   we will not anyway benefit from merging of FLUSH.
> 
> I don't really think we should design the whole thing around broken
> devices which incorrectly report writeback cache when it need not.
> The correct place to work around that is during device identification
> not in the flush logic.

elm3a4_sas and elm3c71_extsas advertise writeback cache yet the flush completion
times are suspiciously low.  I suppose it could be useful to disable flushes to
squeeze out that last bit of performance, though I don't know how one goes
about querying the disk array to learn if there's a battery behind the cache.
I guess the current mechanism (admin knob that picks a safe default) is good
enough.

> > Given that C2 is going to benefit primarily only if queue does not support
> > FUA and we have many requets with REQ_FUA set, will it make sense to 
> > put additional checks for C2. Atleast a simple queue support FUA
> > check might help.
> > 
> > In practice does C2 really help or we can get rid of it entirely?
> 
> Again, issuing flushes as fast as possible isn't necessarily better.
> It might feel counter-intuitive but it generally makes sense to delay
> flush if there are a lot of concurrent flush activities going on.
> Another related interesting point is that with flush merging,
> depending on workload, there's a likelihood that FUA, even if the
> device supports it, might result in worse performance than merged DATA
> + single POSTFLUSH sequence.
> 
> Thanks.
> 
> -- 
> tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/