[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090119120349.GA10193@duck.suse.cz>
Date: Mon, 19 Jan 2009 13:03:49 +0100
From: Jan Kara <jack@...e.cz>
To: Fernando Luis Vázquez Cao
<fernando@....ntt.co.jp>
Cc: Theodore Tso <tytso@....EDU>, Alan Cox <alan@...rguk.ukuu.org.uk>,
Pavel Machek <pavel@...e.cz>,
kernel list <linux-kernel@...r.kernel.org>,
Jens Axboe <jens.axboe@...cle.com>, sandeen@...hat.com,
fernando@....ac.jp
Subject: Re: ext3: call blkdev_issue_flush on fsync
On Sat 17-01-09 19:00:49, Fernando Luis Vázquez Cao wrote:
> On Sat, 2009-01-17 at 18:47 +0900, Fernando Luis Vázquez Cao wrote:
> > On Fri, 2009-01-16 at 17:30 +0100, Jan Kara wrote:
> > > On Fri 16-01-09 22:55:01, Fernando Luis Vázquez Cao wrote:
> > > > To ensure that bits are truly on-disk after an fsync or fdatasync, we
> > > > should force a disk flush explicitly when there is dirty data/metadata
> > > > and the journal didn't emit a write barrier (either because metadata is
> > > > not being synched or barriers are disabled).
> > > >
> > > > Signed-off-by: Fernando Luis Vazquez Cao <fernando@....ntt.co.jp>
> > > > ---
> > > Only two minor nits:
> > >
> > > > --- linux-2.6.29-rc1-orig/fs/ext3/fsync.c 2008-12-25 08:26:37.000000000 +0900
> > > > +++ linux-2.6.29-rc1/fs/ext3/fsync.c 2009-01-16 22:18:53.000000000 +0900
> > > > @@ -27,6 +27,7 @@
> > > > #include <linux/sched.h>
> > > > #include <linux/writeback.h>
> > > > #include <linux/jbd.h>
> > > > +#include <linux/blkdev.h>
> > > > #include <linux/ext3_fs.h>
> > > > #include <linux/ext3_jbd.h>
> > > >
> > > > @@ -45,6 +46,8 @@
> > > > int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
> > > > {
> > > > struct inode *inode = dentry->d_inode;
> > > > + journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
> > > > + unsigned long i_state = inode->i_state;
> > > > int ret = 0;
> > > >
> > > > J_ASSERT(ext3_journal_current_handle() == NULL);
> > > > @@ -69,23 +72,33 @@ int ext3_sync_file(struct file * file, s
> > > > */
> > > > if (ext3_should_journal_data(inode)) {
> > > > ret = ext3_force_commit(inode->i_sb);
> > > > + if (!(journal->j_flags & JFS_BARRIER))
> > > > + goto no_journal_barrier;
> > > > goto out;
> > > > }
> > > >
> > > > - if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> > > > - goto out;
> > > > + if (datasync && !(i_state & I_DIRTY_DATASYNC))
> > > > + goto flush_blkdev;
> > > >
> > > > /*
> > > > * The VFS has written the file data. If the inode is unaltered
> > > > * then we need not start a commit.
> > > > */
> > > > - if (inode->i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > + if (i_state & (I_DIRTY_SYNC|I_DIRTY_DATASYNC)) {
> > > > struct writeback_control wbc = {
> > > > .sync_mode = WB_SYNC_ALL,
> > > > .nr_to_write = 0, /* sys_fsync did this */
> > > > };
> > > > ret = sync_inode(inode, &wbc);
> > > > + if (journal && !(journal->j_flags & JFS_BARRIER))
> > > > + goto no_journal_barrier;
> > > I cannot imagine "journal" will be NULL here.
> >
> > I'll try to check whether that is always so just in case.
> >
> > > And we can also optimize here a bit and do "goto out" because here
> > > we know the barrier has been issued.
> >
> > Yep, I was considering the same optimization. By the way, I was
> > wondering if we should honor ext3 and ext4's "barrier" mount option for
> > sys_fsync()/sys_fdatasync() and do not force a flush when "barrier=1".
>
> The last phrase should read " do not force a flush when "barrier=0" ".
I was hesitating about this a bit. But I don't think so. The reason is
that POSIX (or any other reasonable specification) mandates that fsync()
should return only after the data is safely on storage. So if we don't
flush blockdevice caches, we effectively violate POSIX and we should never
do that. With barriers the matter is a bit different - that is just a
filesystem specific thing, no standard guarantees anything.
Honza
--
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists