linux-kernel - Re: Write-back from inside FS

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20070929130011.c3a11139.akpm@linux-foundation.org>
Date:	Sat, 29 Sep 2007 13:00:11 -0700
From:	Andrew Morton <akpm@...ux-foundation.org>
To:	Artem Bityutskiy <dedekind@...dex.ru>
Cc:	Linux Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: Write-back from inside FS - need suggestions

On Sat, 29 Sep 2007 22:10:42 +0300 Artem Bityutskiy <dedekind@...dex.ru> wrote:

> Andrew Morton wrote:
> > I'd have thought that a suitable wrapper around a suitably-modified
> > sync_sb_inodes() would be appropriate for both filesystems?
> 
> Ok, I've modified sync_inodes_sb() so that I can pass it my own wbc,
> where I set wcb->nr_to_write = 20. It gives me _exactly_ what I want.
> It just flushes a bit more then 20 pages and returns. I use
> WB_SYNC_ALL. Great!

ok..

> Now I would like to understand why it works :-) To my surprise, it
> does not deadlock! I call it from ->prepare_write where I'm holding
> i_mutex, and it works just fine. It calls ->writepage() without trying
> to lock i_mutex! This looks like some witchcraft for me.

writepage under i_mutex is commonly done on the
sys_write->alloc_pages->direct-reclaim path.  It absolutely has to work,
and you'll be fine relying upon that.

However ->prepare_write() is called with the page locked, so you are
vulnerable to deadlocks there.  I suspect you got lucky because the page
which you're holding the lock on is not dirty in your testing.  But in
other applications (eg: 1k blocksize ext2/3/4) the page _can_ be dirty
while we're trying to allocate more blocks for it, in which case the
lock_page() deadlock can happen.

One approach might be to add another flag to writeback_control telling
write_cache_pages() to skip locked pages.  Or even put a page* into
wrietback_control and change it to skip *this* page.

> This means that if I'm in the middle of an operation or ino #X, I own
> its i_mutex, but not I_LOCK, I can be preempted and ->writepage can
> be called for a dirty page belonging to this inode #X?

yup.  Or another CPU can do the same.

> I haven't seen
> this in practice and I do not believe this may happen. Why?

Perhaps a heavier workload is needed.

There is code in the VFS which tries to prevent lots of CPUs from getting
in and fighting with each other (see writeback_acquire()) which will have
the effect of serialising things for some extent.  But writeback_acquire()
is causing scalability problems on monster IO systems and might be removed,
and it is only a partial thing - there are other ways in which concurrent
writeout can occur (fsync, sync, page reclaim, ...)

> Could you or someone please give me a hint what exactly
> inode->i_flags & I_LOCK protects?

err, it's basically an open-coded mutex via which one thread can get
exclusive access to some parts of an inode's internals.  Perhaps it could
literally be replaced with a mutex.  Exactly what I_LOCK protects has not
been documented afaik.  That would need to be reverse engineered :(

> What is its relationship to i_mutex?

On a regular file i_mutex is used mainly for protection of the data part of
the file, although it gets borrowed for other things, like protecting f_pos
of all the inode's file*'s.  I_LOCK is used to serialise access to a few
parts of the inode itself.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/