linux-kernel - Re: [PATCH RFC v3] vfs: make fstatat retry once on ESTALE errors from getattr call

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20120423145935.141cfe0b@tlielax.poochiereds.net>
Date:	Mon, 23 Apr 2012 14:59:35 -0400
From:	Jeff Layton <jlayton@...hat.com>
To:	Miklos Szeredi <miklos@...redi.hu>
Cc:	"J. Bruce Fields" <bfields@...ldses.org>,
	Malahal Naineni <malahal@...ibm.com>,
	Steve Dickson <SteveD@...hat.com>,
	linux-fsdevel@...r.kernel.org, linux-nfs@...r.kernel.org,
	linux-kernel@...r.kernel.org, viro@...iv.linux.org.uk,
	hch@...radead.org, michael.brantley@...haw.com,
	sven.breuner@...m.fraunhofer.de, chuck.lever@...cle.com,
	pstaubach@...grid.com, trond.myklebust@....uio.no, rees@...ch.edu
Subject: Re: [PATCH RFC v3] vfs: make fstatat retry once on ESTALE errors
 from getattr call

On Mon, 23 Apr 2012 17:28:19 +0200
Miklos Szeredi <miklos@...redi.hu> wrote:

> Jeff Layton <jlayton@...hat.com> writes:
> 
> >
> > Ok, but again, that only applies to the lookup. It has no bearing on
> > the subsequent operation. For instance, if we're doing:
> >
> >      rename("/foo", "/bar");
> >
> > ...and another client is simultaneously doing:
> >
> >      creat("/bar/baz", 0600);
> >
> > ...and we get back ESTALE from the server on the create because the
> > "old" /bar got replaced after the lookup of it. Then it seems like
> > returning -ENOENT would not be correct since there was never a time
> > where /bar didn't exist...
> 
> It may not be "correct" according to some standard.  But it's what Linux
> does since day one on *all* filesystems.  And probably other OS's that
> have remotely scalable lookup routines.
> 
> Thanks,
> Miklos

The race window with local filesystems is very small though. I'm not
sure I've ever heard of anyone hitting this on one in practice. For NFS
OTOH, it's quite large and it's easy to hit an ESTALE without even
trying very hard.

On a related note, here's a first pass at making the getattr codepaths
handle ERETRYLOOKUP. It assumes that we don't need to fix the devtmpfs
callers or block/loop.c.

The really ugly part is that we have to deal with other callers that
don't have the ability to retry. So we have to convert ERETRYLOOKUP
back to ESTALE in those callers.

Dealing with vfs_getattr is actually fairly trivial. A tougher one would
be the inode_permission codepaths. Those go all over the place and it
would be tough to ensure that we don't leak this new error code to
userspace.

Personally, I think this approach is just too ugly to live in these
other codepaths, but it seems like a reasonable approach to allow us to
retry indefinitely on an ESTALE from a lookup.

-------------------------------------------------------------------------------
vfs-add-an-retry_lookup-functi
-------------------------------------------------------------------------------
vfs: add an retry_lookup function to handle retries on ERETRYLOOKUP

From: Jeff Layton <jlayton@...hat.com>

Signed-off-by: Jeff Layton <jlayton@...hat.com>
---

 fs/namei.c         |   23 +++++++++++++++++++++++
 include/linux/fs.h |    2 ++
 2 files changed, 25 insertions(+), 0 deletions(-)


diff --git a/fs/namei.c b/fs/namei.c
index fef6a0d..a781b0e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -116,6 +116,29 @@
  * POSIX.1 2.4: an empty pathname is invalid (ENOENT).
  * PATH_MAX includes the nul terminator --RR.
  */
+
+/**
+ * retry_lookup - determine whether the caller should retry a lookup and op
+ *
+ * @error: the error we'll be returning
+ * @retry: what retry attempt we're on
+ *
+ */
+bool
+retry_lookup(const long error, const unsigned int retry)
+{
+	long timeout;
+
+	if (likely(error != -ERETRYLOOKUP))
+		return false;
+
+	/* arbitrarily cap backoff delay at 1s */
+	timeout = retry * 2;
+	timeout = timeout > HZ ? HZ : timeout;
+
+	return !schedule_timeout_killable(timeout);
+}
+
 static int do_getname(const char __user *filename, char *page)
 {
 	int retval;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8de6755..5cc03c3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2026,6 +2026,8 @@ extern struct file * dentry_open(struct dentry *, struct vfsmount *, int,
 extern int filp_close(struct file *, fl_owner_t id);
 extern char * getname(const char __user *);
 
+extern bool retry_lookup(const long error, const unsigned int retry);
+
 /* fs/ioctl.c */
 
 extern int ioctl_preallocate(struct file *filp, void __user *argp);
-------------------------------------------------------------------------------
vfs-make-fstatat-retry-on-esta
-------------------------------------------------------------------------------
vfs: make fstatat retry on ESTALE errors from getattr call

From: Jeff Layton <jlayton@...hat.com>

Have NFS return ERETRYLOOKUP to the vfs when it gets an ESTALE back on a
GETATTR call. The VFS will then retry the lookup and call when it sees
that error.

Unfortunately, we do need to deal with the potential for "leakage" of
this error when we can't retry, so we must check for it in *all* callers
of vfs_getattr and swap -ESTALE for it when it occurs.

Signed-off-by: Jeff Layton <jlayton@...hat.com>
---

 fs/ecryptfs/inode.c |    2 +-
 fs/nfs/inode.c      |    2 +-
 fs/nfsd/nfsproc.c   |    1 +
 fs/stat.c           |   20 ++++++++++++--------
 4 files changed, 15 insertions(+), 10 deletions(-)


diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index ab35b11..d4b5f15 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -1068,7 +1068,7 @@ int ecryptfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
 		generic_fillattr(dentry->d_inode, stat);
 		stat->blocks = lower_stat.blocks;
 	}
-	return rc;
+	return rc == -ERETRYLOOKUP ? -ESTALE : rc;
 }
 
 int
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e8bbfa5..2916e87 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -541,7 +541,7 @@ int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 		stat->ino = nfs_compat_user_ino64(NFS_FILEID(inode));
 	}
 out:
-	return err;
+	return err == -ESTALE ? -ERETRYLOOKUP : err;
 }
 
 static void nfs_init_lock_context(struct nfs_lock_context *l_ctx)
diff --git a/fs/nfsd/nfsproc.c b/fs/nfsd/nfsproc.c
index e15dc45..5dee967 100644
--- a/fs/nfsd/nfsproc.c
+++ b/fs/nfsd/nfsproc.c
@@ -742,6 +742,7 @@ nfserrno (int errno)
 		{ nfserr_notsupp, -EOPNOTSUPP },
 		{ nfserr_toosmall, -ETOOSMALL },
 		{ nfserr_serverfault, -ESERVERFAULT },
+		{ nfserr_stale, -ERETRYLOOKUP },
 	};
 	int	i;
 
diff --git a/fs/stat.c b/fs/stat.c
index c733dc5..fb4cf29 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -64,7 +64,7 @@ int vfs_fstat(unsigned int fd, struct kstat *stat)
 		error = vfs_getattr(f->f_path.mnt, f->f_path.dentry, stat);
 		fput(f);
 	}
-	return error;
+	return error == -ERETRYLOOKUP ? -ESTALE : error;
 }
 EXPORT_SYMBOL(vfs_fstat);
 
@@ -73,7 +73,8 @@ int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
 {
 	struct path path;
 	int error = -EINVAL;
-	int lookup_flags = 0;
+	unsigned int try = 0;
+	unsigned int lookup_flags = 0;
 
 	if ((flag & ~(AT_SYMLINK_NOFOLLOW | AT_NO_AUTOMOUNT |
 		      AT_EMPTY_PATH)) != 0)
@@ -84,12 +85,15 @@ int vfs_fstatat(int dfd, const char __user *filename, struct kstat *stat,
 	if (flag & AT_EMPTY_PATH)
 		lookup_flags |= LOOKUP_EMPTY;
 
-	error = user_path_at(dfd, filename, lookup_flags, &path);
-	if (error)
-		goto out;
-
-	error = vfs_getattr(path.mnt, path.dentry, stat);
-	path_put(&path);
+	do {
+		error = user_path_at(dfd, filename, lookup_flags, &path);
+		if (!error) {
+			error = vfs_getattr(path.mnt, path.dentry, stat);
+			path_put(&path);
+			break;
+		}
+		lookup_flags |= LOOKUP_REVAL;
+	} while (retry_lookup(error, try++));
 out:
 	return error;
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/