[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20060717140751.GW2114946@melbourne.sgi.com>
Date: Tue, 18 Jul 2006 00:07:51 +1000
From: David Chinner <dgc@....com>
To: Nathan Scott <nathans@....com>
Cc: Masayuki Saito <m-saito@...s.nec.co.jp>,
David Chinner <dgc@....com>, xfs@....sgi.com,
linux-kernel@...r.kernel.org
Subject: Re: [PATCH] xfs: i_state of inode is changed after the inode is freed
On Mon, Jul 17, 2006 at 09:05:01PM +1000, Nathan Scott wrote:
> On Fri, Jul 14, 2006 at 07:25:20PM +0900, Masayuki Saito wrote:
> > Hi, Nathan.
> >
> > >I'll leave it to Dave to comment more later (he's travelling at the
> > >moment), since he's had his head deep in this area of the code most
> > >recently - but my first thoughts on your patch are that its solving
> > >the problem incorrectly. We should not be in the destroy_inode code
> > >if the inode reference counting is correct everywhere - I would have
> > >expected the fix to be a get/put style change, rather than adding an
> > >inode lock and new lock/unlock semantics around an individual field;
> > >... and if that cannot be done to fix this (eh?), then some comments
> > >as to why refcounting didn't solve the problem here.
Looking at this a bit deeper, I think the reference counting is
correct and the problem is that we are not breaking the link between
the linux inode and the xfs inode atomically w.r.t xfs_iunpin()
usage....
> > On the basis of the above, I consider the get/put style fix which use
> > i_count.
> >
> > This problem is that i_state of the inode is changed while the inode
> > is freed in xfs filesystem. And the cause is that the inode release
> > and xfs_iunpun() can run in parallel.
> >
> > To fix this problem, I added a pair of igrab()/iput() before and behind
> > mark_inode_dirty_sync() at xfs_iunpin(). I think this can change it as
> > follows.
> >
> > (1)The case that the inode release transaction runs after xfs_iunpin()
> > is called.
> > While mark_inode_dirty_sync() is running, igrab() promises that the
> > inode is alive.
> >
> > (2)The case that xfs_iunpin() is called after iput() in the inode
> > release transaction is called(i_count is 0).
> > mark_inode_dirty_sync() is not called because the igrab() can not get
> > the inode.
> >
> > I have made the following patch, but it is not yet tested.
> > I would like to hear your comment, first.
>
> If this fixes your test case, then I like the look of it. ;-)
I don't think it fixes the problem because igrab() fails to handle
the case we are hitting where I_CLEAR is set on the inode when we
mark it dirty. There's nothing in this patch preventing us from
sleeping after the !(I_NEW|I_FREEING|I_CLEAR) check is done and then
racing with generic_drop_inode() before the igrab() can take a
reference on the inode.
ATM, I think the real problem is the use of the XFS_IRECLAIMABLE
flag and the locking involved. Nathan, the inode hash lock
is used to sync that flag between xfs_iget_core() and
xfs_finish_reclaim(), but no lock is used when setting it or
(now) checking in xfs_iunpin().....
Worse, the i_flags field does not use atomic bitops and
there is no consistent locking protecting i_flags so updates
and reads of this filed can race or even get lost....
Also, I think that xfs_iunpin() must execute atomically w.r.t
xfs_reclaim(), otherwise we cannot ever safely do the checks
we are doing in xfs_iunpin().
> It does seem much simpler and less invasive than the earlier fix
> using a spinlock. I'll run with this in my testing for awhile,
> let me know how your own testing goes too, please (I especially
> would like to hear if it fixes that reproducible failure case).
I think a fix is going to be much more invasive than just adding
reference as my fixes appear to have only narrowed the race window
and not solved it. The addition of the lock in the original patch
solves the atomic xfs_iunpin()/xfs_reclaim() execution problem,
but it does not solve the problems with the i_flags field. Adding
a new lock may be our only option here.
This will have to wait until after OLS before I get a chance to
look at this further...
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists