[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-Id: <20070929130011.c3a11139.akpm@linux-foundation.org>
Date: Sat, 29 Sep 2007 13:00:11 -0700
From: Andrew Morton <akpm@...ux-foundation.org>
To: Artem Bityutskiy <dedekind@...dex.ru>
Cc: Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Write-back from inside FS - need suggestions
On Sat, 29 Sep 2007 22:10:42 +0300 Artem Bityutskiy <dedekind@...dex.ru> wrote:
> Andrew Morton wrote:
> > I'd have thought that a suitable wrapper around a suitably-modified
> > sync_sb_inodes() would be appropriate for both filesystems?
>
> Ok, I've modified sync_inodes_sb() so that I can pass it my own wbc,
> where I set wcb->nr_to_write = 20. It gives me _exactly_ what I want.
> It just flushes a bit more then 20 pages and returns. I use
> WB_SYNC_ALL. Great!
ok..
> Now I would like to understand why it works :-) To my surprise, it
> does not deadlock! I call it from ->prepare_write where I'm holding
> i_mutex, and it works just fine. It calls ->writepage() without trying
> to lock i_mutex! This looks like some witchcraft for me.
writepage under i_mutex is commonly done on the
sys_write->alloc_pages->direct-reclaim path. It absolutely has to work,
and you'll be fine relying upon that.
However ->prepare_write() is called with the page locked, so you are
vulnerable to deadlocks there. I suspect you got lucky because the page
which you're holding the lock on is not dirty in your testing. But in
other applications (eg: 1k blocksize ext2/3/4) the page _can_ be dirty
while we're trying to allocate more blocks for it, in which case the
lock_page() deadlock can happen.
One approach might be to add another flag to writeback_control telling
write_cache_pages() to skip locked pages. Or even put a page* into
wrietback_control and change it to skip *this* page.
> This means that if I'm in the middle of an operation or ino #X, I own
> its i_mutex, but not I_LOCK, I can be preempted and ->writepage can
> be called for a dirty page belonging to this inode #X?
yup. Or another CPU can do the same.
> I haven't seen
> this in practice and I do not believe this may happen. Why?
Perhaps a heavier workload is needed.
There is code in the VFS which tries to prevent lots of CPUs from getting
in and fighting with each other (see writeback_acquire()) which will have
the effect of serialising things for some extent. But writeback_acquire()
is causing scalability problems on monster IO systems and might be removed,
and it is only a partial thing - there are other ways in which concurrent
writeout can occur (fsync, sync, page reclaim, ...)
> Could you or someone please give me a hint what exactly
> inode->i_flags & I_LOCK protects?
err, it's basically an open-coded mutex via which one thread can get
exclusive access to some parts of an inode's internals. Perhaps it could
literally be replaced with a mutex. Exactly what I_LOCK protects has not
been documented afaik. That would need to be reverse engineered :(
> What is its relationship to i_mutex?
On a regular file i_mutex is used mainly for protection of the data part of
the file, although it gets borrowed for other things, like protecting f_pos
of all the inode's file*'s. I_LOCK is used to serialise access to a few
parts of the inode itself.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Powered by blists - more mailing lists