linux-kernel - Re: [PATCH] mm: add missing mutex lock arround notify

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20111219042240.GN23662@dastard>
Date:	Mon, 19 Dec 2011 15:22:40 +1100
From:	Dave Chinner <david@...morbit.com>
To:	Al Viro <viro@...IV.linux.org.uk>
Cc:	Andrew Morton <akpm@...ux-foundation.org>,
	Djalal Harouni <tixxdz@...ndz.org>,
	Hugh Dickins <hughd@...gle.com>,
	Minchan Kim <minchan.kim@...il.com>,
	KAMEZAWA Hiroyuki <kamezawa.hiroyu@...fujitsu.com>,
	Wu Fengguang <fengguang.wu@...el.com>, linux-mm@...ck.org,
	linux-kernel@...r.kernel.org,
	"J. Bruce Fields" <bfields@...ldses.org>,
	Neil Brown <neilb@...e.de>,
	Mikulas Patocka <mikulas@...ax.karlin.mff.cuni.cz>,
	Christoph Hellwig <hch@...radead.org>
Subject: Re: [PATCH] mm: add missing mutex lock arround notify_change

On Mon, Dec 19, 2011 at 02:03:40AM +0000, Al Viro wrote:
> On Mon, Dec 19, 2011 at 12:43:43PM +1100, Dave Chinner wrote:
> > > We have a shitload of deadlocks on very common paths with that patch.  What
> > > of the paths that do lead to file_remove_suid() without i_mutex?
> > > *	xfs_file_aio_write_checks(): we drop i_mutex (via xfs_rw_iunlock())
> > > just before calling file_remove_suid().  Racy, the fix is obvious - move
> > > file_remove_suid() call before unlocking.
> > 
> > Not exactly. xfs_rw_iunlock() is not doing what you think it's doing
> > there.....
> 
> Huh?  It is called as 
> 
> > > -	xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
> 
> and thus in
> static inline void
> xfs_rw_iunlock(
>         struct xfs_inode        *ip,
>         int                     type)
> {
>         xfs_iunlock(ip, type);
>         if (type & XFS_IOLOCK_EXCL)
>                 mutex_unlock(&VFS_I(ip)->i_mutex);
> }
> we are guaranteed to hit i_mutex.  

XFS_ILOCK_EXCL != XFS_IOLOCK_EXCL.

There are 3 locks here, not 2 (outermost first):

	VFS_I(ip)->i_mutex (mutex)
	ip->i_iolock (rwsem)
	ip->i_ilock (rwsem)

xfs_ilock() maintains iolock vs ilock ordering for the entire of
XFS, while xfs_rw_ilock() maintains i_mutex vs iolock vs ilock
ordering for the IO path. it is all lockdep annotated, so if we make
a mistake, we know about it pretty quickly....

> > Wrong lock.  That's dropping the internal XFS inode metadata lock,
> > but the VFS i_mutex is associated with the internal XFS inode IO
> > lock, which is accessed via XFS_IOLOCK_*. Only if we take the iolock
> > via XFS_IOLOCK_EXCL do we actually take the i_mutex.
> 
> > Now it gets complex. For buffered IO, we are guaranteed to already
> > be holding the i_mutex because we do:
> > 
> >         *iolock = XFS_IOLOCK_EXCL;
> >         xfs_rw_ilock(ip, *iolock);
> > 
> >         ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
> > 
> > So that is safe and non-racy right now.
> 
> No, it is not - we *drop* it before calling file_remove_suid().  Explicitly.

We drop the innermost ip->i_ilock before calling file_remove_suid().
We are still holding ip->i_iolock in EXCL mode and the
VFS_I(ip)->i_mutex as well at this point.

> Again, look at that xfs_rw_iunlock() call there - it does drop i_mutex
> (which is to say, you'd better have taken it prior to that, or you have
> far worse problems).

> 
> > For direct IO, however, we don't always take the IOLOCK exclusively.
> > Indeed, we try really, really hard not to do this so we can do
> > concurrent reads and writes to the inode, and that results
> > in a bunch of lock juggling when we actually need the IOLOCK
> > exclusive (like in xfs_file_aio_write_checks()). It sounds like we
> > need to know if we are going to have to remove the SUID bit ahead of
> > time so that we can  take the correct lock up front. I haven't
> > looked at what is needed to do that yet.
> 
> OK, I'm definitely missing something.  The very first thing
> xfs_file_aio_write_checks() does is
>         xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
> which really makes me wonder how the hell does that manage to avoid an
> instant deadlock in case of call via xfs_file_buffered_aio_write()
> where we have:
>         struct address_space    *mapping = file->f_mapping;
>         struct inode            *inode = mapping->host;
>         struct xfs_inode        *ip = XFS_I(inode);
>         *iolock = XFS_IOLOCK_EXCL;
>         xfs_rw_ilock(ip, *iolock);
>         ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
> which leads to
>         struct inode            *inode = file->f_mapping->host;
>         struct xfs_inode        *ip = XFS_I(inode);
> (IOW, inode and ip are the same as in the caller) followed by
>         xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
> and with both xfs_rw_ilock() calls turning into
> 	mutex_lock(&VFS_I(ip)->i_mutex);
>         xfs_ilock(ip, XFS_ILOCK_EXCL);

> we ought to deadlock on that i_mutex.  What am I missing and how do we manage
> to survive that?

Translation of the buffered write IO path locking without wrappers:

	mutex_lock(&VFS_I(ip)->i_mutex);
	down_write_nested(&ip->i_iolock);
	xfs_file_aio_write_checks() {
		down_write_nested(&ip->i_ilock);
		.....
		up_write(&ip->i_ilock);
		file_remove_suid();
	}
	generic_file_buffered_write()
	up_write(&ip->i_iolock);
	mutex_unlock(&ip->i_iolock);

Locking is correctly ordered, and file_remove_suid() is called with
the i_mutex held.

The direct IO write fast path does this:

	down_read_nested(&ip->i_iolock);
	xfs_file_aio_write_checks() {
		down_write_nested(&ip->i_ilock);
		.....
		up_write(&ip->i_ilock);
		file_remove_suid();
	}
	generic_file_direct_write()
	up_read(&ip->i_iolock);

So isn't holding the i_mutex when file_remove_suid() is called. We
do conditionally call xfs_file_aio_write_checks() with the correct
locks held if we've had to flush the page cache, we are doing
unaligned IO or extending the file, in which case the locks taken
and the ordering is the same as for the buffered write IO path. We
can do exactly the same relock dance if we know we have to remove
the SUID bit before we actually do remove it...

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/