[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <1241014093.20099.28.camel@think.oraclecorp.com>
Date: Wed, 29 Apr 2009 10:08:13 -0400
From: Chris Mason <chris.mason@...cle.com>
To: Jan Kara <jack@...e.cz>
Cc: Linus Torvalds <torvalds@...ux-foundation.org>,
"Theodore Ts'o" <tytso@....edu>,
Linux Kernel Developers List <linux-kernel@...r.kernel.org>,
Ext4 Developers List <linux-ext4@...r.kernel.org>,
Mike Galbraith <efault@....de>
Subject: Re: [PATCH RFC] ext3 data=guarded v5
On Wed, 2009-04-29 at 10:56 +0200, Jan Kara wrote:
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index fcfa243..1e90107 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -38,6 +38,7 @@
> > #include <linux/bio.h>
> > #include <linux/fiemap.h>
> > #include <linux/namei.h>
> > +#include <linux/workqueue.h>
> > #include "xattr.h"
> > #include "acl.h"
> >
> > @@ -179,6 +180,105 @@ static int ext3_journal_test_restart(handle_t *handle, struct inode *inode)
> > }
> >
> > /*
> > + * after a data=guarded IO is done, we need to update the
> > + * disk i_size to reflect the data we've written. If there are
> > + * no more ordered data extents left in the tree, we need to
> ^^^^^^^^ the list
> > + * get rid of the orphan entry making sure the file's
> > + * block pointers match the i_size after a crash
> > + *
> > + * When we aren't in data=guarded mode, this just does an ext3_orphan_del.
> > + *
> > + * It returns the result of ext3_orphan_del.
> > + *
> > + * handle may be null if we are just cleaning up the orphan list in
> > + * memory.
> > + *
> > + * pass must_log == 1 when the inode must be logged in order to get
> > + * an i_size update on disk
> > + */
> > +static int ordered_orphan_del(handle_t *handle, struct inode *inode,
> > + int must_log)
> > +{
> I'm afraid this function is racy.
> 1) We probably need i_mutex to protect against unlink happening in parallel
> (after we check i_nlink but before we all ext3_orphan_del).
This would mean IO completion (clearing PG_writeback) would have to wait
on the inode mutex, which we can't quite do in O_SYNC and O_DIRECT.
But, what I can do is check i_nlink after the ext3_orphan_del call and
put the inode back on the orphan list if it has gone to zero.
> 2) We need superblock lock for the check list_empty(&EXT3_I(inode)->i_orphan).
How about I take the guarded spinlock when doing the list_add instead?
I'm trying to avoid the superblock lock as much as I can.
> 3) The function should rather have name ext3_guarded_orphan_del()... At
> least "ordered" is really confusing (that's the case for a few other
> structs / variables as well).
My long term plan was to replaced ordered with guarded, but I can rename
this one to guarded if you think it'll make it more clear.
> > +/*
> > + * Wrapper around ordered_orphan_del that starts a transaction
> > + */
> > +static void ordered_orphan_del_trans(struct inode *inode, int must_log)
> > +{
> This function is going to be used only from one place, so consider
> opencoding it. I don't have a strong opinions
Yeah, I think it keeps the code a little more readable to have it
separate....gcc will inline the thing for us anyway.
> > + *
> > + * extend_disksize is only called for directories, and so
> > + * the are not using guarded buffer protection.
> ^^^ The sentence is strange...
Thanks
> > */
> > - if (!err && extend_disksize && inode->i_size > ei->i_disksize)
> > - ei->i_disksize = inode->i_size;
> > + if (!err && extend_disksize)
> > + maybe_update_disk_isize(inode, inode->i_size);
> So do we really need to take the ordered lock for directories? We could
> just leave above two lines as they were.
Good point
>
> > mutex_unlock(&ei->truncate_mutex);
> > if (err)
> > goto cleanup;
> >
> > set_buffer_new(bh_result);
> > + set_buffer_datanew(bh_result);
> > got_it:
> > map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
> > if (count > blocks_to_boundary)
> > @@ -1079,6 +1210,77 @@ struct buffer_head *ext3_bread(handle_t *handle, struct inode *inode,
> > return NULL;
> > }
> >
> > +/*
> > + * data=guarded updates are handled in a workqueue after the IO
> > + * is done. This runs through the list of buffer heads that are pending
> > + * processing.
> > + */
> > +void ext3_run_guarded_work(struct work_struct *work)
> > +{
> > + struct ext3_sb_info *sbi =
> > + container_of(work, struct ext3_sb_info, guarded_work);
> > + struct buffer_head *bh;
> > + struct ext3_ordered_extent *ordered;
> > + struct inode *inode;
> > + struct page *page;
> > + int must_log;
> > +
> > + spin_lock_irq(&sbi->guarded_lock);
> > + while (!list_empty(&sbi->guarded_buffers)) {
> > + ordered = list_entry(sbi->guarded_buffers.next,
> > + struct ext3_ordered_extent, work_list);
> > +
> > + list_del(&ordered->work_list);
> > +
> > + bh = ordered->end_io_bh;
> > + ordered->end_io_bh = NULL;
> > + must_log = 0;
> > +
> > + /* we don't need a reference on the buffer head because
> > + * it is locked until the end_io handler is called.
> > + *
> > + * This means the page can't go away, which means the
> > + * inode can't go away
> > + */
> > + spin_unlock_irq(&sbi->guarded_lock);
> > +
> > + page = bh->b_page;
> > + inode = page->mapping->host;
> > +
> > + ext3_ordered_lock(inode);
> > + if (ordered->bh) {
> > + /*
> > + * someone might have decided this buffer didn't
> > + * really need to be ordered and removed us from
> > + * the list. They set ordered->bh to null
> > + * when that happens.
> > + */
> > + ext3_remove_ordered_extent(inode, ordered);
> > + must_log = ext3_ordered_update_i_size(inode);
> > + }
> > + ext3_ordered_unlock(inode);
> > +
> > + /*
> > + * drop the reference taken when this ordered extent was
> > + * put onto the guarded_buffers list
> > + */
> > + ext3_put_ordered_extent(ordered);
> > +
> > + /*
> > + * maybe log the inode and/or cleanup the orphan entry
> > + */
> > + ordered_orphan_del_trans(inode, must_log > 0);
> > +
> > + /*
> > + * finally, call the real bh end_io function to do
> > + * all the hard work of maintaining page writeback.
> > + */
> > + end_buffer_async_write(bh, buffer_uptodate(bh));
> > + spin_lock_irq(&sbi->guarded_lock);
> > + }
> > + spin_unlock_irq(&sbi->guarded_lock);
> > +}
> > +
> > static int walk_page_buffers( handle_t *handle,
> > struct buffer_head *head,
> > unsigned from,
> > @@ -1185,6 +1387,7 @@ retry:
> > ret = walk_page_buffers(handle, page_buffers(page),
> > from, to, NULL, do_journal_get_write_access);
> > }
> > +
> > write_begin_failed:
> > if (ret) {
> > /*
> > @@ -1212,7 +1415,13 @@ out:
> >
> > int ext3_journal_dirty_data(handle_t *handle, struct buffer_head *bh)
> > {
> > - int err = journal_dirty_data(handle, bh);
> > + int err;
> > +
> > + /* don't take buffers from the data=guarded list */
> > + if (buffer_dataguarded(bh))
> > + return 0;
> > +
> > + err = journal_dirty_data(handle, bh);
> But this has a problem that if we do extending write (like from pos 1024
> to pos 2048) and then do write from 0 to 1024 and we hit the window while
> the buffer is on the work queue list, we won't order this write. Probably
> we don't care but I wanted to note this...
Yeah, in this case the guarded IO should protect i_size, and this write
won't really be ordered. The block could have zeros from 0-1024 if we
crash.
>
> > if (err)
> > ext3_journal_abort_handle(__func__, __func__,
> > bh, handle, err);
> > @@ -1231,6 +1440,89 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
> > return 0;
> > }
> >
> > +/*
> > + * Walk the buffers in a page for data=guarded mode. Buffers that
> > + * are not marked as datanew are ignored.
> > + *
> > + * New buffers outside i_size are sent to the data guarded code
> > + *
> > + * We must do the old data=ordered mode when filling holes in the
> > + * file, since i_size doesn't protect these at all.
> > + */
> > +static int journal_dirty_data_guarded_fn(handle_t *handle,
> > + struct buffer_head *bh)
> > +{
> > + u64 offset = page_offset(bh->b_page) + bh_offset(bh);
> > + struct inode *inode = bh->b_page->mapping->host;
> > + int ret = 0;
> > +
> > + /*
> > + * Write could have mapped the buffer but it didn't copy the data in
> > + * yet. So avoid filing such buffer into a transaction.
> > + */
> > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > + return 0;
> > +
> > + if (test_clear_buffer_datanew(bh)) {
> Hmm, if we just extend the file inside the block (e.g. from 100 bytes to
> 500 bytes), then we won't do the write guarded. But then if we crash before
> the block really gets written, user will see zeros at the end of file
> instead of data...
You see something like this:
create(file)
write(file, 100 bytes) # create guarded IO
fsync(file)
write(file, 400 more bytes) # buffer isn't guarded, i_size goes to 500
> I don't think we should let this happen so I'd think we
> have to guard all the extending writes regardless whether they allocate new
> block or not.
My main concern was avoiding stale data from the disk after a crash,
zeros from partially written blocks are not as big a problem. But,
you're right that we can easily avoid this, so I'll update the patch to
do all extending writes as guarded.
> Which probably makes the buffer_datanew() flag unnecessary
> because we just guard all the buffers from max(start of write, i_size) to
> end of write.
But, we still want buffer_datanew to decide when writes that fill holes
should go through data=ordered.
> > +/*
> > + * Walk the buffers in a page for data=guarded mode for writepage.
> > + *
> > + * We must do the old data=ordered mode when filling holes in the
> > + * file, since i_size doesn't protect these at all.
> > + *
> > + * This is actually called after writepage is run and so we can't
> > + * trust anything other than the buffer head (which we have pinned).
> > + *
> > + * Any datanew buffer at writepage time is filling a hole, so we don't need
> > + * extra tests against the inode size.
> > + */
> > +static int journal_dirty_data_guarded_writepage_fn(handle_t *handle,
> > + struct buffer_head *bh)
> > +{
> > + int ret = 0;
> > +
> > + /*
> > + * Write could have mapped the buffer but it didn't copy the data in
> > + * yet. So avoid filing such buffer into a transaction.
> > + */
> > + if (!buffer_mapped(bh) || !buffer_uptodate(bh))
> > + return 0;
> > +
> > + if (test_clear_buffer_datanew(bh))
> > + ret = ext3_journal_dirty_data(handle, bh);
> > + return ret;
> > +}
> > +
> Hmm, here we use the datanew flag as well. But it's probably not worth
> keeping it just for this case. Ordering data in all cases when we get here
> should be fine since if the block is already allocated we should not get
> here (unless somebody managed to strip buffers from the page but kept the
> page but that should be rare enough).
>
I'd keep it for the common case of filling holes with write(), so then
the code in writepage is gravy.
> > @@ -1300,6 +1590,68 @@ static int ext3_ordered_write_end(struct file *file,
> > return ret ? ret : copied;
> > }
> >
> > +static int ext3_guarded_write_end(struct file *file,
> > + struct address_space *mapping,
> > + loff_t pos, unsigned len, unsigned copied,
> > + struct page *page, void *fsdata)
> > +{
> > + handle_t *handle = ext3_journal_current_handle();
> > + struct inode *inode = file->f_mapping->host;
> > + unsigned from, to;
> > + int ret = 0, ret2;
> > +
> > + copied = block_write_end(file, mapping, pos, len, copied,
> > + page, fsdata);
> > +
> > + from = pos & (PAGE_CACHE_SIZE - 1);
> > + to = from + copied;
> > + ret = walk_page_buffers(handle, page_buffers(page),
> > + from, to, NULL, journal_dirty_data_guarded_fn);
> > +
> > + /*
> > + * we only update the in-memory i_size. The disk i_size is done
> > + * by the end io handlers
> > + */
> > + if (ret == 0 && pos + copied > inode->i_size) {
> > + int must_log;
> > +
> > + /* updated i_size, but we may have raced with a
> > + * data=guarded end_io handler.
> > + *
> > + * All the guarded IO could have ended while i_size was still
> > + * small, and if we're just adding bytes into an existing block
> > + * in the file, we may not be adding a new guarded IO with this
> > + * write. So, do a check on the disk i_size and make sure it
> > + * is updated to the highest safe value.
> > + *
> > + * ext3_ordered_update_i_size tests inode->i_size, so we
> > + * make sure to update it with the ordered lock held.
> > + */
> This can go away if we guard all the extending writes...
Yes, good point.
>
> > + ext3_ordered_lock(inode);
> > + i_size_write(inode, pos + copied);
> > +
> > + must_log = ext3_ordered_update_i_size(inode);
> > + ext3_ordered_unlock(inode);
> > + ordered_orphan_del_trans(inode, must_log > 0);
> In case this needs to stay, here we have a transaction started so why not
> just directly call ordered_orphan_del()?
>
Thanks
> > @@ -1747,7 +2238,14 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> > goto out;
> > }
> > orphan = 1;
> > - ei->i_disksize = inode->i_size;
> > + /* in guarded mode, other code is responsible
> > + * for updating i_disksize. Actually in
> > + * every mode, ei->i_disksize should be correct,
> > + * so I don't understand why it is getting updated
> > + * to i_size here.
> > + */
> > + if (!ext3_should_guard_data(inode))
> > + ei->i_disksize = inode->i_size;
> Hmm, true. When we acquire i_mutex, i_size should be equal to i_disksize
> so this seems rather pointless. Probably worth a separate patch to remove
> it...
Yeah, I didn't want to go around messing with O_DIRECT in this
patchset ;)
>
> > ext3_journal_stop(handle);
> > }
> > }
> > @@ -1768,11 +2266,20 @@ static ssize_t ext3_direct_IO(int rw, struct kiocb *iocb,
> > ret = PTR_ERR(handle);
> > goto out;
> > }
> > +
> > if (inode->i_nlink)
> > - ext3_orphan_del(handle, inode);
> > + ordered_orphan_del(handle, inode, 0);
> > +
> > if (ret > 0) {
> > loff_t end = offset + ret;
> > if (end > inode->i_size) {
> > + /* i_mutex keeps other file writes from
> > + * hopping in at this time, and we
> > + * know the O_DIRECT write just put all
> > + * those blocks on disk. So, we can
> > + * safely update i_disksize here even
> > + * in guarded mode
> > + */
> Not quite - there could be guarded blocks before the place where we did
> O_DIRECT write and we need to wait for them...
Hmmm, O_DIRECT is only waiting on the blocks it actually wrote isn't it.
Good point, will fix.
-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists