linux-kernel - Re: [PATCH v4] fuse: add new function to invalidate cache for all inodes

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <Z6u5dumvZHf_BDHM@dread.disaster.area>
Date: Wed, 12 Feb 2025 07:56:22 +1100
From: Dave Chinner <david@...morbit.com>
To: Luis Henriques <luis@...lia.com>
Cc: Miklos Szeredi <miklos@...redi.hu>, linux-fsdevel@...r.kernel.org,
	linux-kernel@...r.kernel.org, Matt Harvey <mharvey@...ptrading.com>,
	Bernd Schubert <bschubert@....com>,
	Joanne Koong <joannelkoong@...il.com>
Subject: Re: [PATCH v4] fuse: add new function to invalidate cache for all
 inodes

[ FWIW: if the commit message directly references someone else's
related (and somewhat relevant) work, please directly CC those
people on the patch(set). I only noticed this by chance, not because
I read every FUSE related patch that goes by me. ]

On Tue, Feb 11, 2025 at 09:26:04AM +0000, Luis Henriques wrote:
> Currently userspace is able to notify the kernel to invalidate the cache
> for an inode.  This means that, if all the inodes in a filesystem need to
> be invalidated, then userspace needs to iterate through all of them and do
> this kernel notification separately.
> 
> This patch adds a new option that allows userspace to invalidate all the
> inodes with a single notification operation.  In addition to invalidate
> all the inodes, it also shrinks the sb dcache.

That, IMO, seems like a bit naive - we generally don't allow user
controlled denial of service vectors to be added to the kernel. i.e.
this is the equivalent of allowing FUSE fs specific 'echo 1 >
/proc/sys/vm/drop_caches' via some fuse specific UAPI. We only allow
root access to /proc/sys/vm/drop_caches because it can otherwise be
easily abused to cause system wide performance issues.

It also strikes me as a somewhat dangerous precendent - invalidating
random VFS caches through user APIs hidden deep in random fs
implementations makes for poor visibility and difficult maintenance
of VFS level functionality...

> Signed-off-by: Luis Henriques <luis@...lia.com>
> ---
> * Changes since v3
> - Added comments to clarify semantic changes in fuse_reverse_inval_inode()
>   when called with FUSE_INVAL_ALL_INODES (suggested by Bernd).
> - Added comments to inodes iteration loop to clarify __iget/iput usage
>   (suggested by Joanne)
> - Dropped get_fuse_mount() call -- fuse_mount can be obtained from
>   fuse_ilookup() directly (suggested by Joanne)
> 
> (Also dropped the RFC from the subject.)
> 
> * Changes since v2
> - Use the new helper from fuse_reverse_inval_inode(), as suggested by Bernd.
> - Also updated patch description as per checkpatch.pl suggestion.
> 
> * Changes since v1
> As suggested by Bernd, this patch v2 simply adds an helper function that
> will make it easier to replace most of it's code by a call to function
> super_iter_inodes() when Dave Chinner's patch[1] eventually gets merged.
> 
> [1] https://lore.kernel.org/r/20241002014017.3801899-3-david@fromorbit.com

That doesn't make the functionality any more palatable.

Those iterators are the first step in removing the VFS inode list
and only maintaining it in filesystems that actually need this
functionality. We want this list to go away because maintaining it
is a general VFS cache scalability limitation.

i.e. if a filesystem has internal functionality that requires
iterating all instantiated inodes, the filesystem itself should
maintain that list in the most efficient manner for the filesystem's
iteration requirements not rely on the VFS to maintain this
information for it.

That's the point of the iterator methods the above patchset adds -
it allows the filesystem to provide the VFS with a method for
iterating all inodes in the filesystem whilst the transition period
where we rework the inode cache structure (i.e. per-sb hash tables
for inode lookup, inode LRU caching goes away, etc). Once that
rework gets done, there won't be a VFS inode cache to iterate.....

>  fs/fuse/inode.c           | 83 +++++++++++++++++++++++++++++++++++----
>  include/uapi/linux/fuse.h |  3 ++
>  2 files changed, 79 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index e9db2cb8c150..5aa49856731a 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -547,25 +547,94 @@ struct inode *fuse_ilookup(struct fuse_conn *fc, u64 nodeid,
>  	return NULL;
>  }
>  
> +static void inval_single_inode(struct inode *inode, struct fuse_conn *fc)
> +{
> +	struct fuse_inode *fi;
> +
> +	fi = get_fuse_inode(inode);
> +	spin_lock(&fi->lock);
> +	fi->attr_version = atomic64_inc_return(&fc->attr_version);
> +	spin_unlock(&fi->lock);
> +	fuse_invalidate_attr(inode);
> +	forget_all_cached_acls(inode);
> +}
> +
> +static int fuse_reverse_inval_all(struct fuse_conn *fc)
> +{
> +	struct fuse_mount *fm;
> +	struct super_block *sb;
> +	struct inode *inode, *old_inode = NULL;
> +
> +	inode = fuse_ilookup(fc, FUSE_ROOT_ID, &fm);
> +	if (!inode || !fm)
> +		return -ENOENT;
> +
> +	iput(inode);
> +	sb = fm->sb;
> +
> +	spin_lock(&sb->s_inode_list_lock);
> +	list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> +		spin_lock(&inode->i_lock);
> +		if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> +		    !atomic_read(&inode->i_count)) {
> +			spin_unlock(&inode->i_lock);
> +			continue;
> +		}

This skips every inode that is unreferenced and cached on the
LRU. i.e. it only invalidates inodes that have a current reference
(e.g. dentry pins it, has an open file, etc).

What's the point of only invalidating actively referenced inodes?

> +		/*
> +		 * This __iget()/iput() dance is required so that we can release
> +		 * the sb lock and continue the iteration on the previous
> +		 * inode.  If we don't keep a ref to the old inode it could have
> +		 * disappear.  This way we can safely call cond_resched() when
> +		 * there's a huge amount of inodes to iterate.
> +		 */

If there's a huge amount of inodes to iterate, then most of them are
going to be on the LRU and unreferenced, so this code won't even get
here to be able to run cond_resched().

> +		__iget(inode);
> +		spin_unlock(&inode->i_lock);
> +		spin_unlock(&sb->s_inode_list_lock);
> +		iput(old_inode);
> +
> +		inval_single_inode(inode, fc);
> +
> +		old_inode = inode;
> +		cond_resched();
> +		spin_lock(&sb->s_inode_list_lock);
> +	}
> +	spin_unlock(&sb->s_inode_list_lock);
> +	iput(old_inode);
> +
> +	shrink_dcache_sb(sb);

Why drop all the referenced inodes held by the dentry cache -after-
inode invalidation? Doesn't this mean that racing operations are
going to see a valid dentries backed by an invalidated inode?  Why
aren't the dentries pruned from the cache first, and new lookups
blocked until the invalidation completes?

I'm left to ponder why the invalidation isn't simply:

	/* Remove all possible active references to cached inodes */
	shrink_dcache_sb();

	/* Remove all unreferenced inodes from cache */
	invalidate_inodes();

Which will result in far more of the inode cache for the filesystem
being invalidated than the above code....

-Dave.
-- 
Dave Chinner
david@...morbit.com