[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1234521410.3795.117.camel@sebastian.kern.oss.ntt.co.jp>
Date: Fri, 13 Feb 2009 19:36:50 +0900
From: Fernando Luis Vázquez Cao
<fernando@....ntt.co.jp>
To: Eric Sandeen <sandeen@...hat.com>
Cc: Jan Kara <jack@...e.cz>, Theodore Tso <tytso@....EDU>,
Alan Cox <alan@...rguk.ukuu.org.uk>,
Pavel Machek <pavel@...e.cz>,
kernel list <linux-kernel@...r.kernel.org>,
Jens Axboe <jens.axboe@...cle.com>, fernando@....ac.jp,
Ric Wheeler <rwheeler@...hat.com>
Subject: Re: vfs: Add MS_FLUSHONFSYNC mount flag
On Fri, 2009-02-13 at 00:20 -0600, Eric Sandeen wrote:
> Fernando Luis Vázquez Cao wrote:
> > On Thu, 2009-02-12 at 11:13 -0600, Eric Sandeen wrote:
> >> Fernando Luis Vázquez Cao wrote:
> >>> This mount flag will be used to determine whether the block device's write
> >>> cache should be flush or not on fsync()/fdatasync().
> >>>
> >>> Signed-off-by: Fernando Luis Vazquez Cao <fernando@....ntt.co.jp>
> >>> ---
> >> Again, apologies for chiming in late.
> >>
> >> But wouldn't it be better to make this a block device property rather
> >> than a new filesystem mount option?
> >>
> >> That way the filesystem can always do "the right thing" and call the
> >> blkdev flush on fsync.
> >>
> >> The block device *could* choose to ignore this in hardware if it knows
> >> it's built with a nonvolatile write cache or if it has no write cache.
> >>
> >> Somewhere in the middle, if an administrator knows they have a UPS they
> >> trust and hardware that stays connected to it, they could tune the bdev
> >> to ignore these flush requests.
> >>
> >> Also that way if you have 8 partitions on a battery-backed blockdev, you
> >> can tune it once, instead of needing to mount all 8 filesystems with the
> >> new option.
> >
> > The main reason I decided to go for the mount option approach is to be
> > consistent with what we do when it comes to write barriers. Treating one
> > as a mount option and the other as a (possibly) sysfs tunable property
> > seems a bit confusing to me.
>
> well... technically, I think barriers really *should* mean "don't
> reorder these writes, I need them this way for consistency" - and that
> is really specific to the fs implementation, isn't it? (we just happen
> to implement them as cache flushes) and so that is a per-fs setting, I
> think.
>
> Maybe there is no good argument for ignoring barriers on one fs, and
> implementing them on another, other than playing fast & loose &
> dangerous.... hrm.
>
> > Do you suggest using sysfs tunables instead?
>
> For a per-bdev flush setting, yes...
>
> I guess I'll have to try to convince myself one way or another whether
> barrier mount options are consistent with this view. :)
I went through the same process :), and finally concluded that in both
cases it all comes down a trade-off between between integrity (be it
filesystem integrity or data integrity) and speed. Since the desired
behavior could vary from filesystem to filesystem and for the sake of
consistency with barriers the mount option approach seemed to make more
sense.
> I guess sometimes you do have workloads where you simply want speed, and
> on a crash you start over. In this case you don't care about barriers
> (ordering constraints - if you don't care about fs integrity if fsck or
> re-mkfs is ok) or flushing (caches - if you don't care about data
> integrity, you regenerate your results). That could vary from fs to fs....
>
> I'm just a little leery of the "dangerous" mount option proliferation, I
> guess.
If we can have distros make "barrier=1,flushonfsync" be the default
setting and document these two mount options properly by explicitly
indicating the dangers of deviating from these defaults, I think we'll
be heading in the right direction.
By the way, apart from the issue of whether flushonsync should be a
mount option or sysfs tunable is there any other issue with the patches?
Jan, have I addressed all your concerns?
Regards,
Fernando
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists