linux-kernel - Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a directory

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20250206-pocken-entzwei-bd310c8d61b3@brauner>
Date: Thu, 6 Feb 2025 15:36:57 +0100
From: Christian Brauner <brauner@...nel.org>
To: NeilBrown <neilb@...e.de>
Cc: Alexander Viro <viro@...iv.linux.org.uk>, Jan Kara <jack@...e.cz>, 
	Linus Torvalds <torvalds@...ux-foundation.org>, Jeff Layton <jlayton@...nel.org>, 
	Dave Chinner <david@...morbit.com>, linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 00/19 v7?] RFC: Allow concurrent and async changes in a
 directory

On Thu, Feb 06, 2025 at 04:42:37PM +1100, NeilBrown wrote:
> This is my latest attempt at removing the requirement for an exclusive
> lock on a directory which performing updates in this.  This version,
> inspired by Dave Chinner, goes a step further and allow async updates.
> 
> The inode operation still requires the inode lock, at least a shared
> lock, but may return -EINPROGRES and then continue asynchronously
> without needing any ongoing lock on the directory.
> 
> An exclusive lock on the dentry is held across the entire operation.
> 
> This change requires various extra checks.  rmdir must ensure there is
> no async creation still happening.  rename between directories must
> ensure non of the relevant ancestors are undergoing async rename.  There
> may be or checks that I need to consider - mounting?

Mounting takes an exclusive lock on the target inode in do_lock_mount()
and finish_automount(). As long as dont_mount() can't happen
asynchronously in vfs_rmdir(), vfs_unlink() or vfs_rename() it should be
fine.

> 
> One other important change since my previous posting is that I've
> dropped the idea of taking a separate exclusive lock on the directory
> when the fs doesn't support shared locking.  This cannot work as it
> doeesn't prevent lookups and filesystems don't expect a lookup while
> they are changing a directory.  So instead we need to choose between
> exclusive or shared for the inode on a case-by-case basis.

Which is possibly fine if we do it similar to what I suggested in the
series. As it stands with the separate methods it's a no-go for me. But
that's a solvable problem, I think.

> To make this choice we divide all ops into four groups: create, remove,
> rename, open/create.  If an inode has no operations in the group that
> require an exclusive lock, then a flag is set on the inode so that
> various code knows that a shared lock is sufficient.  If the flag is not
> set, an exclusive lock is obtained.
> 
> I've also added rename handling and converted NFS to use all _async ops.
> 
> The motivation for this comes from the general increase in scale of
> systems.  We can support very large directories and many-core systems
> and applications that choose to use large directories can hit
> unnecessary contention.
> 
> NFS can easily hit this when used over a high-latency link.
> Lustre already has code to allow concurrent directory updates in the
> back-end filesystem (ldiskfs - a slightly modified ext4).
> Lustre developers believe this would also benefit the client-side
> filesystem with large core counts.
> 
> The idea behind the async support is to eventually connect this to
> io_uring so that one process can launch several concurrent directory
> operations.  I have not looked deeply into io_uring and cannot be
> certain that the interface I've provided will be able to be used.  I
> would welcome any advice on that matter, though I hope to find time to
> explore myself.  For now if any _async op returns -EINPROGRESS we simply
> wait for the callback to indicate completion.
> 
> Test status:  only light testing.  It doesn't easily blow up, but lockdep
> complains that repeated calls to d_update_wait() are bad, even though
> it has balanced acquire and release calls. Weird?
> 
> Thanks,
> NeilBrown
> 
>  [PATCH 01/19] VFS: introduce vfs_mkdir_return()
>  [PATCH 02/19] VFS: use global wait-queue table for d_alloc_parallel()
>  [PATCH 03/19] VFS: use d_alloc_parallel() in lookup_one_qstr_excl()
>  [PATCH 04/19] VFS: change kern_path_locked() and
>  [PATCH 05/19] VFS: add common error checks to lookup_one_qstr()
>  [PATCH 06/19] VFS: repack DENTRY_ flags.
>  [PATCH 07/19] VFS: repack LOOKUP_ bit flags.
>  [PATCH 08/19] VFS: introduce lookup_and_lock() and friends
>  [PATCH 09/19] VFS: add _async versions of the various directory
>  [PATCH 10/19] VFS: introduce inode flags to report locking needs for
>  [PATCH 11/19] VFS: Add ability to exclusively lock a dentry and use
>  [PATCH 12/19] VFS: enhance d_splice_alias to accommodate shared-lock
>  [PATCH 13/19] VFS: lock dentry for ->revalidate to avoid races with
>  [PATCH 14/19] VFS: Ensure no async updates happening in directory
>  [PATCH 15/19] VFS: Change lookup_and_lock() to use shared lock when
>  [PATCH 16/19] VFS: add lookup_and_lock_rename()
>  [PATCH 17/19] nfsd: use lookup_and_lock_one() and
>  [PATCH 18/19] nfs: change mkdir inode_operation to mkdir_async
>  [PATCH 19/19] nfs: switch to _async for all directory ops.