[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-id: <173888595790.22054.15435342804967126282@noble.neil.brown.name>
Date: Fri, 07 Feb 2025 10:52:37 +1100
From: "NeilBrown" <neilb@...e.de>
To: "Christian Brauner" <brauner@...nel.org>
Cc: "Alexander Viro" <viro@...iv.linux.org.uk>, "Jan Kara" <jack@...e.cz>,
"Linus Torvalds" <torvalds@...ux-foundation.org>,
"Jeff Layton" <jlayton@...nel.org>, "Dave Chinner" <david@...morbit.com>,
linux-fsdevel@...r.kernel.org, linux-kernel@...r.kernel.org
Subject: Re: [PATCH 01/19] VFS: introduce vfs_mkdir_return()
On Thu, 06 Feb 2025, Christian Brauner wrote:
> On Thu, Feb 06, 2025 at 04:42:38PM +1100, NeilBrown wrote:
> > vfs_mkdir() does not guarantee to make the child dentry positive on
> > success. It may leave it negative and then the caller needs to perform a
> > lookup to find the target dentry.
> >
> > This patch introduced vfs_mkdir_return() which performs the lookup if
> > needed so that this code is centralised.
> >
> > This prepares for a new inode operation which will perform mkdir and
> > returns the correct dentry.
> >
> > Signed-off-by: NeilBrown <neilb@...e.de>
> > ---
> > fs/cachefiles/namei.c | 7 +---
> > fs/namei.c | 69 ++++++++++++++++++++++++++++++++++++++++
> > fs/nfsd/vfs.c | 21 ++----------
> > fs/overlayfs/dir.c | 33 +------------------
> > fs/overlayfs/overlayfs.h | 10 +++---
> > fs/overlayfs/super.c | 2 +-
> > fs/smb/server/vfs.c | 24 +++-----------
> > include/linux/fs.h | 2 ++
> > 8 files changed, 86 insertions(+), 82 deletions(-)
> >
> > diff --git a/fs/cachefiles/namei.c b/fs/cachefiles/namei.c
> > index 7cf59713f0f7..3c866c3b9534 100644
> > --- a/fs/cachefiles/namei.c
> > +++ b/fs/cachefiles/namei.c
> > @@ -95,7 +95,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> > /* search the current directory for the element name */
> > inode_lock_nested(d_inode(dir), I_MUTEX_PARENT);
> >
> > -retry:
> > ret = cachefiles_inject_read_error();
> > if (ret == 0)
> > subdir = lookup_one_len(dirname, dir, strlen(dirname));
> > @@ -130,7 +129,7 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> > goto mkdir_error;
> > ret = cachefiles_inject_write_error();
> > if (ret == 0)
> > - ret = vfs_mkdir(&nop_mnt_idmap, d_inode(dir), subdir, 0700);
> > + ret = vfs_mkdir_return(&nop_mnt_idmap, d_inode(dir), &subdir, 0700);
> > if (ret < 0) {
> > trace_cachefiles_vfs_error(NULL, d_inode(dir), ret,
> > cachefiles_trace_mkdir_error);
> > @@ -138,10 +137,6 @@ struct dentry *cachefiles_get_directory(struct cachefiles_cache *cache,
> > }
> > trace_cachefiles_mkdir(dir, subdir);
> >
> > - if (unlikely(d_unhashed(subdir))) {
> > - cachefiles_put_directory(subdir);
> > - goto retry;
> > - }
> > ASSERT(d_backing_inode(subdir));
> >
> > _debug("mkdir -> %pd{ino=%lu}",
> > diff --git a/fs/namei.c b/fs/namei.c
> > index 3ab9440c5b93..d98caf36e867 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -4317,6 +4317,75 @@ int vfs_mkdir(struct mnt_idmap *idmap, struct inode *dir,
> > }
> > EXPORT_SYMBOL(vfs_mkdir);
> >
> > +/**
> > + * vfs_mkdir_return - create directory returning correct dentry
> > + * @idmap: idmap of the mount the inode was found from
> > + * @dir: inode of the parent directory
> > + * @dentryp: pointer to dentry of the child directory
> > + * @mode: mode of the child directory
> > + *
> > + * Create a directory.
> > + *
> > + * If the inode has been found through an idmapped mount the idmap of
> > + * the vfsmount must be passed through @idmap. This function will then take
> > + * care to map the inode according to @idmap before checking permissions.
> > + * On non-idmapped mounts or if permission checking is to be performed on the
> > + * raw inode simply pass @nop_mnt_idmap.
> > + *
> > + * The filesystem may not use the dentry that was passed in. In that case
> > + * the passed-in dentry is put and a new one is placed in *@...tryp;
> > + * So on successful return *@...tryp will always be positive.
> > + */
> > +int vfs_mkdir_return(struct mnt_idmap *idmap, struct inode *dir,
> > + struct dentry **dentryp, umode_t mode)
> > +{
>
> I think this is misnamed. Maybe vfs_mkdir_positive() is better here.
> It also be nice to have a comment on vfs_mkdir() as well pointing out
> that the returned dentry might be negative.
While I'm not particularly fond of vfs_mkdir_return(), I don't see that
vfs_mkdir_positive() is an improvement. I cannot find any relevant
precedent in the kernel to guide. Most _return and _positive functions
are for low-level counting primitives :-)
I'm tempted to add another arg to vfs_mkdir() instead of adding a new
function. That would solve one problem by introducing another: what
arg? Maybe pass both a 'struct dentry *' and a 'struct dentry **' and
if the latter is not NULL, it gets filled with the new dentry if there
is one.
>
> And is there a particular reason to not have it return the new dentry?
> That seems clearer than using the argument as a return value.
If I did that then every caller would need to check if the return value
was not IS_ERR_OR_NULL() and if so, dput() the original dentry and keep
the new one - just like current callers of ->lookup need to. It seems
cleaner to do that once in vfs_mkdir_return() rather than in all the
callers. I guess we could *always* return the dentry on success and
dput the old one if it was different or if there were an error. So
dentry = vfs_mkdir_return(idmap, inode, dentry, mode)
would be the common pattern. Would you be OK with that?
>
> > + struct dentry *dentry = *dentryp;
> > + int error;
> > + unsigned max_links = dir->i_sb->s_max_links;
> > +
> > + error = may_create(idmap, dir, dentry);
> > + if (error)
> > + return error;
> > +
> > + if (!dir->i_op->mkdir)
> > + return -EPERM;
> > +
> > + mode = vfs_prepare_mode(idmap, dir, mode, S_IRWXUGO | S_ISVTX, 0);
> > + error = security_inode_mkdir(dir, dentry, mode);
> > + if (error)
> > + return error;
> > +
> > + if (max_links && dir->i_nlink >= max_links)
> > + return -EMLINK;
> > +
> > + error = dir->i_op->mkdir(idmap, dir, dentry, mode);
>
> Why isn't this calling vfs_mkdir() and then only starts differing afterwards?
Because once we introduce the new ->mkdir_async which returns a dentry
the two functions start to diverge more.
I could have a __vfs_mkdir() which does both and has a bool arg to tell
it if we want the return value. That would avoid code duplication.
>
> > + if (!error) {
> > + fsnotify_mkdir(dir, dentry);
> > + if (unlikely(d_unhashed(dentry))) {
> > + struct dentry *d;
> > + /* Need a "const" pointer. We know d_name is const
> > + * because we hold an exclusive lock on i_rwsem
> > + * in d_parent.
> > + */
> > + const struct qstr *d_name = (void*)&dentry->d_name;
> > + d = lookup_dcache(d_name, dentry->d_parent, 0);
> > + if (!d)
> > + d = __lookup_slow(d_name, dentry->d_parent, 0);
>
> Quite a few caller's use lookup_one() here which calls
> inode_permission() on @dir again. Are we guaranteed that the permission
> check would always pass?
I think they use lookup_one() because that was the easiest, not because
they need all the functionality.
If the process had permission to create a directory with a given name
but now doesn't have permission to look up that same name, then
something is weird. Maybe a race with permission changing could do
that. But I think the process should have the right to hold the dentry
that it has just successfully created.
The lookup is hopefully just a work-around until the new improved
interface is used by all relevant filesystems.
Thanks,
NeilBrown
Powered by blists - more mailing lists