linux-ext4 - Re: Locking issue with directory renames

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20230302092144.yvj5rcxnbe57nqie@quack3>
Date:   Thu, 2 Mar 2023 10:21:44 +0100
From:   Jan Kara <jack@...e.cz>
To:     Dave Chinner <david@...morbit.com>
Cc:     Jan Kara <jack@...e.cz>, "Darrick J. Wong" <djwong@...nel.org>,
        Al Viro <viro@...iv.linux.org.uk>,
        linux-fsdevel@...r.kernel.org, linux-ext4@...r.kernel.org,
        Ted Tso <tytso@....edu>, linux-xfs@...r.kernel.org
Subject: Re: Locking issue with directory renames

On Thu 02-03-23 11:30:50, Dave Chinner wrote:
> On Wed, Mar 01, 2023 at 01:36:28PM +0100, Jan Kara wrote:
> > On Tue 28-02-23 12:58:07, Dave Chinner wrote:
> > > On Fri, Feb 24, 2023 at 07:46:57PM -0800, Darrick J. Wong wrote:
> > > > So xfs_dir2_sf_replace can rewrite the shortform structure (or even
> > > > convert it to block format!) while readdir is accessing it.  Or am I
> > > > mising something?
> > > 
> > > True, I missed that.
> > > 
> > > Hmmmm. ISTR that holding ILOCK over filldir callbacks causes
> > > problems with lock ordering{1], and that's why we removed the ILOCK
> > > from the getdents path in the first place and instead relied on the
> > > IOLOCK being held by the VFS across readdir for exclusion against
> > > concurrent modification from the VFS.
> > > 
> > > Yup, the current code only holds the ILOCK for the extent lookup and
> > > buffer read process, it drops it while it is walking the locked
> > > buffer and calling the filldir callback. Which is why we don't hold
> > > it for xfs_dir2_sf_getdents() - the VFS is supposed to be holding
> > > i_rwsem in exclusive mode for any operation that modifies a
> > > directory entry. We should only need the ILOCK for serialising the
> > > extent tree loading, not for serialising access vs modification to
> > > the directory.
> > > 
> > > So, yeah, I think you're right, Darrick. And the fix is that the VFS
> > > needs to hold the i_rwsem correctly for allo inodes that may be
> > > modified during rename...
> > 
> > But Al Viro didn't want to lock the inode in the VFS (as some filesystems
> > don't need the lock)
> 
> Was any reason given?

Kind of what I wrote above. See:

https://lore.kernel.org/all/Y8bTk1CsH9AaAnLt@ZenIV
 
> We know we have to modify the ".." entry of the child directory
> being moved, so I'd really like to understand why the locking rule
> of "directory i_rwsem must be held exclusively over modifications"
> so that we can use shared access for read operations has been waived
> for this specific case.

Well, not every filesystem has physical ".." directory entry but I share
your sentiment that avoiding grabbing the directory lock in this particular
case is not worth the maintenance burden of trying to track down all the
filesystems that may need it. So I'm still all for grabbing it in VFS and
maybe Al is willing to reconsider given XFS was found to be prone to the
race as well. Al?

> Apart from exposing multiple filesystems to modifications racing
> with operations that hold the i_rwsem shared to *prevent concurrent
> directory modifications*, what performance or scalability benefit is
> seen as a result of eliding this inode lock from the VFS rename
> setup?
> 
> This looks like a straight forward VFS level directory
> locking violation, and now we are playing whack-a-mole to fix it in
> each filesystem we discover that requires the child directory inode
> to be locked...
> 
> > so in ext4 we ended up grabbing the lock in
> > ext4_rename() like:
> > 
> > +               /*
> > +                * We need to protect against old.inode directory getting
> > +                * converted from inline directory format into a normal one.
> > +                */
> > +               inode_lock_nested(old.inode, I_MUTEX_NONDIR2);
> 
> Why are you using the I_MUTEX_NONDIR2 annotation when locking a
> directory inode? That doesn't seem right.

Because that's the only locking subclass left unused during rename and it
happens to have the right ordering for ext4 purposes wrt other i_rwsem
subclasses. In other words it is a hack to fix the race and silence lockdep
;). If we are going to lift this to VFS, we should probably add
I_MUTEX_MOVED_DIR subclass, possibly as an alias to I_MUTEX_NONDIR2. 

> Further, how do we guarantee correct i_rwsem lock ordering against
> the all the other inodes that the VFS has already locked and/or
> other multi-inode i_rwsem locking primitives in the VFS?

Well, cross directory renames are all serialized by sb->s_vfs_rename_mutex
so we don't have to be afraid of two renames racing against each other.
Also directories are locked in topological order by all operations so
grabbing moved directory lock last is the safe thing to do (because rename
is the only operation that can lock two topologically incomparable
directories).

								Honza
-- 
Jan Kara <jack@...e.com>
SUSE Labs, CR