[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20090305100000.GA29177@duck.suse.cz>
Date: Thu, 5 Mar 2009 11:00:01 +0100
From: Jan Kara <jack@...e.cz>
To: Nick Piggin <npiggin@...e.de>
Cc: linux-fsdevel@...r.kernel.org,
Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
Andrew Morton <akpm@...ux-foundation.org>,
"Jorge Boncompte [DTI2]" <jorge@...2.net>,
Adrian Hunter <ext-adrian.hunter@...ia.com>,
Jan Kara <jack@...e.cz>, stable@...nel.org
Subject: Re: [patch] fs: new inode i_state corruption fix
On Thu 05-03-09 07:45:54, Nick Piggin wrote:
>
> There was a report of a data corruption http://lkml.org/lkml/2008/11/14/121.
> There is a script included to reproduce the problem.
>
> During testing, I encountered a number of strange things with ext3, so I
> tried ext2 to attempt to reduce complexity of the problem. I found that
> fsstress would quickly hang in wait_on_inode, waiting for I_LOCK to be
> cleared, even though instrumentation showed that unlock_new_inode had
> already been called for that inode. This points to memory scribble, or
> synchronisation problme.
>
> i_state of I_NEW inodes is not protected by inode_lock because other
> processes are not supposed to touch them until I_LOCK (and I_NEW) is
> cleared. Adding WARN_ON(inode->i_state & I_NEW) to sites where we modify
> i_state revealed that generic_sync_sb_inodes is picking up new inodes
> from the inode lists and passing them to __writeback_single_inode without
> waiting for I_NEW. Subsequently modifying i_state causes corruption. In
> my case it would look like this:
Good catch.
> CPU0 CPU1
> unlock_new_inode() __sync_single_inode()
> reg <- inode->i_state
> reg -> reg & ~(I_LOCK|I_NEW) reg <- inode->i_state
> reg -> inode->i_state reg -> reg | I_SYNC
> reg -> inode->i_state
>
> Non-atomic RMW on CPU1 overwrites CPU0 store and sets I_LOCK|I_NEW again.
>
> Fix for this is rather than wait for I_NEW inodes, just skip over them:
> inodes concurrently being created are not subject to data integrity
> operations, and should not significantly contribute to dirty memory either.
>
> After this change, I'm unable to reproduce any of the added warnings or hangs
> after ~1hour of running. Previously, the new warnings would start immediately
> and hang would happen in under 5 minutes.
A quick grep seems to indicate that you've still missed a few cases,
haven't you? I still see the same problem in
drop_caches.c:drop_pagecache_sb() scanning, inode.c:invalidate_inodes()
scanning, and dquot.c:add_dquot_ref() scanning.
Otherwise the patch looks fine.
> I'm also testing on ext3 now, and so far no problems there either. I don't
> know whether this fixes the problem reported above, but it fixes a real
> problem for me.
>
> Cc: "Jorge Boncompte [DTI2]" <jorge@...2.net>
> Cc: Adrian Hunter <ext-adrian.hunter@...ia.com>
> Cc: Jan Kara <jack@...e.cz>
> Cc: stable@...nel.org
> Signed-off-by: Nick Piggin <npiggin@...e.de>
Honza
>
> Index: linux-2.6/fs/inode.c
> ===================================================================
> --- linux-2.6.orig/fs/inode.c 2009-03-05 14:08:11.000000000 +1100
> +++ linux-2.6/fs/inode.c 2009-03-05 17:20:35.000000000 +1100
> @@ -359,6 +359,7 @@ static int invalidate_list(struct list_h
> invalidate_inode_buffers(inode);
> if (!atomic_read(&inode->i_count)) {
> list_move(&inode->i_list, dispose);
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_FREEING;
> count++;
> continue;
> @@ -460,6 +461,7 @@ static void prune_icache(int nr_to_scan)
> continue;
> }
> list_move(&inode->i_list, &freeable);
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_FREEING;
> nr_pruned++;
> }
> @@ -656,6 +658,7 @@ void unlock_new_inode(struct inode *inod
> * just created it (so there can be no old holders
> * that haven't tested I_LOCK).
> */
> + WARN_ON((inode->i_state & (I_LOCK|I_NEW)) != (I_LOCK|I_NEW));
> inode->i_state &= ~(I_LOCK|I_NEW);
> wake_up_inode(inode);
> }
> @@ -1145,6 +1148,7 @@ void generic_delete_inode(struct inode *
>
> list_del_init(&inode->i_list);
> list_del_init(&inode->i_sb_list);
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_FREEING;
> inodes_stat.nr_inodes--;
> spin_unlock(&inode_lock);
> @@ -1186,16 +1190,19 @@ static void generic_forget_inode(struct
> spin_unlock(&inode_lock);
> return;
> }
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_WILL_FREE;
> spin_unlock(&inode_lock);
> write_inode_now(inode, 1);
> spin_lock(&inode_lock);
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state &= ~I_WILL_FREE;
> inodes_stat.nr_unused--;
> hlist_del_init(&inode->i_hash);
> }
> list_del_init(&inode->i_list);
> list_del_init(&inode->i_sb_list);
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state |= I_FREEING;
> inodes_stat.nr_inodes--;
> spin_unlock(&inode_lock);
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c 2009-03-05 16:33:22.000000000 +1100
> +++ linux-2.6/fs/fs-writeback.c 2009-03-05 17:17:59.000000000 +1100
> @@ -274,6 +274,7 @@ __sync_single_inode(struct inode *inode,
> int ret;
>
> BUG_ON(inode->i_state & I_SYNC);
> + WARN_ON(inode->i_state & I_NEW);
>
> /* Set I_SYNC, reset I_DIRTY */
> dirty = inode->i_state & I_DIRTY;
> @@ -298,6 +299,7 @@ __sync_single_inode(struct inode *inode,
> }
>
> spin_lock(&inode_lock);
> + WARN_ON(inode->i_state & I_NEW);
> inode->i_state &= ~I_SYNC;
> if (!(inode->i_state & I_FREEING)) {
> if (!(inode->i_state & I_DIRTY) &&
> @@ -470,6 +472,11 @@ void generic_sync_sb_inodes(struct super
> break;
> }
>
> + if (inode->i_state & I_NEW) {
> + requeue_io(inode);
> + continue;
> + }
> +
> if (wbc->nonblocking && bdi_write_congested(bdi)) {
> wbc->encountered_congestion = 1;
> if (!sb_is_blkdev_sb(sb))
> @@ -531,7 +538,7 @@ void generic_sync_sb_inodes(struct super
> list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> struct address_space *mapping;
>
> - if (inode->i_state & (I_FREEING|I_WILL_FREE))
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW))
> continue;
> mapping = inode->i_mapping;
> if (mapping->nrpages == 0)
--
Jan Kara <jack@...e.cz>
SUSE Labs, CR
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists