[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <20090420155010.GG14699@duck.suse.cz>
Date: Mon, 20 Apr 2009 17:50:11 +0200
From: Jan Kara <jack@...e.cz>
To: Chris Mason <chris.mason@...cle.com>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
Theodore Ts'o <tytso@....edu>,
Linux Kernel Developers List <linux-kernel@...r.kernel.org>,
Ext4 Developers List <linux-ext4@...r.kernel.org>
Subject: Re: [PATCH] Add ext3 data=guarded mode
On Mon 20-04-09 10:58:10, Chris Mason wrote:
> > > > 3) Currently truncate() does filemap_write_and_wait() - is it really
> > > > needed? Each guarded bh could carry with itself i_disksize it should update
> > > > to when IO is finished. Extending truncate will just update this i_disksize
> > > > at the last member of the list (or update i_disksize when the list is
> > > > empty).
> > > >
> > > > Shortening truncate will walk the list of guarded bh's, removing from
> > > > the list those beyond new i_size, then it will behave like the extending
> > > > truncate (it works even if current i_disksize is larger than new i_size).
> > > > Note, that before we get to ext3_truncate() mm walks all the pages beyond
> > > > i_size and waits for page writeback so by the time ext3_truncate() is
> > > > called, all the IO is finished and dirty pages are canceled.
> > >
> > > The problem here was the disk i_size being updated by ext3_setattr
> > > before the vmtruncate calls calls ext3_truncate(). So the guarded IO
> > > might wander in and change the i_disksize update done by setattr.
> > >
> > > It all made me a bit dizzy and I just tossed the write_and_wait in
> > > instead.
> > >
> > > At the end of the day, we're waiting for guarded writes only, and we
> > > probably would have ended up waiting on those exact same pages in
> > > vmtruncate anyway. So, I do agree we could avoid the write with more
> > > code, but is this really a performance critical section?
> > Well, not really critical but also not negligible - mainly because with
> > your approach we end up *submitting* new writes we could just be canceled
> > otherwise. Without fdatawrite(), data of short-lived files need not ever
> > reach the disk similarly as in writeback mode (OK, this is connected with
> > the fact that you actually don't have fdatawrite() before ext3_truncate()
> > in ext3_delete_inode() and that's what initially puzzled me).
>
> When we're going down to zero, we don't need it. The i_disksize gets
> updated again by ext3_truncate. I'll toss in a special case for that
> before the write_and_wait.
I'm sorry but why truncate to zero does not need it? If we assume that
IO completion can still happen while ext3_truncate() is running which is
what you're afraid of, then I don't see a big difference between truncate
to zero, truncate to i_disksize (which is from where you do fdatawrite) or
truncate to anything else.
Also two other comments:
...
@@ -915,14 +1042,19 @@ int ext3_get_blocks_handle(handle_t *handle, struct
inode *inode,
* i_disksize growing is protected by truncate_mutex. Don't forget
* to
* protect it if you're about to implement concurrent
* ext3_get_block() -bzzz
+ *
+ * FIXME, I think this only needs to extend the disk i_size when
+ * we're filling holes that came from using ftruncate to increase
+ * i_size. Need to verify.
*/
- if (!err && extend_disksize && inode->i_size > ei->i_disksize)
- ei->i_disksize = inode->i_size;
+ if (!ext3_should_guard_data(inode) && !err && extend_disksize)
+ maybe_update_disk_isize(inode, inode->i_size);
This is kind of confusing. extend_disksize is used only from ext3_getblk()
which is called only for directories so the first condition is always true
- and if it wasn't sometime in future, you'd have a hard time tracking why
i_disksize is not updated. So I'd rather add something like
WARN_ON(extend_disksize && ext3_should_guard_data(inode));
if you wish to keep the check.
Also I think you can have races between direct IO writing to the file
(updating i_disksize) and your completion handler updating i_disksize -
direct IO plays tricks with i_disksize to truncate allocated blocks in case
of failed write... It's all nasty ;( Probably we should somehow set clear
rules about i_disksize updates and clean up the code to obey them.
Otherwise we'll be hunting nasty races another two years...
Honza
--
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists