[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20121004102210.GD4641@quack.suse.cz>
Date: Thu, 4 Oct 2012 12:22:10 +0200
From: Jan Kara <jack@...e.cz>
To: Dmitry Monakhov <dmonakhov@...nvz.org>
Cc: Jan Kara <jack@...e.cz>, linux-ext4@...r.kernel.org, tytso@....edu,
lczerner@...hat.com
Subject: Re: [PATCH 04/11] ext4: completed_io locking cleanup V4
On Wed 03-10-12 15:21:25, Dmitry Monakhov wrote:
> On Tue, 2 Oct 2012 15:30:19 +0200, Jan Kara <jack@...e.cz> wrote:
> > On Tue 02-10-12 16:42:39, Dmitry Monakhov wrote:
> > > On Tue, 2 Oct 2012 13:11:06 +0200, Jan Kara <jack@...e.cz> wrote:
> > > > On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> > > > > On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@...e.cz> wrote:
> > > > > > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > > > > > + spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > > > > + while (!list_empty(&complete)) {
> > > > > > > > > + io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > > > > > + io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > > > > > + /* end_io context can not be destroyed now because it still
> > > > > > > > > + * used by queued worker. Worker thread will destroy it later */
> > > > > > > > > + if (io->flag & EXT4_IO_END_QUEUED)
> > > > > > > > > + list_del_init(&io->list);
> > > > > > > > > + else
> > > > > > > > > + list_move(&io->list, &to_free);
> > > > > > > > > + }
> > > > > > > > > + /* If we are called from worker context, it is time to clear queued
> > > > > > > > > + * flag, and destroy it's end_io if it was converted already */
> > > > > > > > > + if (work_io) {
> > > > > > > > > + work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > > > > > + if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > > > > > + list_add_tail(&work_io->list, &to_free);
> > > > > > > > > }
> > > > > > > > > - list_del_init(&io->list);
> > > > > > > > > spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > > > > - (void) ext4_end_io_nolock(io);
> > > > > > > > > - mutex_unlock(&inode->i_mutex);
> > > > > > > > > -free:
> > > > > > > > > - ext4_free_io_end(io);
> > > > > > > > > +
> > > > > > > > > + while (!list_empty(&to_free)) {
> > > > > > > > > + io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > > > > > + list_del_init(&io->list);
> > > > > > > > > + ext4_free_io_end(io);
> > > > > > > > > + }
> > > > > > > > > + return ret;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > > > > > + */
> > > > > > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > > > > > +{
> > > > > > > > > + ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > > > > > + ext4_do_flush_completed_IO(io->inode, io);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > > > > > +{
> > > > > > > > > + return ext4_do_flush_completed_IO(inode, NULL);
> > > > > > > > > }
> > > > > > > > Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > > > > > can have several IO structures queued in its local lists thus we miss them
> > > > > > > > here and don't properly wait for all conversions?
> > > > > > > No it is not. Because list drained atomically, and
> > > > > > > add_complete_io will queue work only if list is empty.
> > > > > > >
> > > > > > > Race between conversion and dequeue-process is not possible because
> > > > > > > we hold lock during entire walk of complete_list, so from external
> > > > > > > point of view we mark list as conversed(clear unwritten flag)
> > > > > > > happens atomically. I've drawn all possible situations and race not
> > > > > > > happen. If you know any please let me know.
> > > > > > I guess I'm missing something obvious. So let's go step by step:
> > > > > > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> > > > > > conversion for the inode.
> > > > > > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > > > > > 3) The following situation seems to be possible:
> > > > > >
> > > > > > CPU1 CPU2
> > > > > > (worker thread) (truncate)
> > > > > > ext4_end_io_work()
> > > > > > ext4_do_flush_completed_IO()
> > > > > > spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > dump_completed_IO(inode);
> > > > > > list_replace_init(&ei->i_completed_io_list, &unwritten);
> > > > > > spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > >
> > > > > > ext4_flush_completed_IO()
> > > > > > ext4_do_flush_completed_IO()
> > > > > > - sees empty i_completed_io_list
> > > > > > => exits
> > > > > >
> > > > > > But we still have some conversions pending in 'unwritten' list. What am
> > > > > > I missing?
> > > > > Indeed, I've simply missed that case. The case which result silently
> > > > > broke integrity sync ;(
> > > > > Thank you for spotting this. I'll be back with updated version.
> > > > Umm, actually, I was thinking about it and ext4_flush_completed_IO()
> > > > seems to be unnecessary in fsync these days. We don't call aio_complete()
> > > > until we perform the conversion so what fsync does to such IO is undefined.
> > > > Such optimization is a separate matter though.
> > > Yes aio is ok, but integrity fsync after buffered write to unwritten
> > > extent is broken.
> > >
> > > fsync() blkdev_completion kwork
> > > ->filemap_write_and_wait_range
> > > ->ext4_end_bio
> > > ->end_page_writeback
> > > <-- filemap_write_and_wait_range return
> > > ->ext4_add_complete_io
> > >
> > > ->ext4_do_flush_completed_IO
> > > ->list_replace_init
> > > ->ext4_flush_completed_IO
> > > sees empty i_comleted_io_list but pended
> > > conversion still exist
> > > ->ext4_end_io
> > >
> > Correct. Thanks for pointing that out.
> In my deference I should say that integrity fsync was broken before my patch in
> case of buffered writes because end_page_writeback called before
> end_io added to ei->i_comlete_io_list
> fsync() blkdev_completion
> ->filemap_write_and_wait_range
> ->ext4_end_bio
> ->end_page_writeback
> <-- filemap_write_and_wait_range return
> ->ext4_flush_completed_IO
> sees empty i_comleted_io_list but pended
> conversion still exist
> ->ext4_add_complete_io
Right. Actually calling end_page_writeback() before we are sure page can
be correctly reloaded from disk (i.e. before all extent manipulations are
done) is asking for trouble - see e.g. mail
http://lists.openwall.net/linux-ext4/2011/06/08/12 and further discussion.
The discussion was somewhat open-ended but at that time calling
end_page_writeback() after extent conversion was problematic because of
i_mutex. Now we don't need i_mutex for extent conversion, so it is save to
call end_page_writeback() after we convert the extents. So moving
end_page_writeback() there would be good and it would simplify come logic
as well I believe - in particular fsync() would be simpler.
Honza
--
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Powered by blists - more mailing lists