lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <Pine.LNX.4.64.1103041443290.29012@cobra.newdream.net>
Date:	Fri, 4 Mar 2011 15:18:57 -0800 (PST)
From:	Sage Weil <sage@...dream.net>
To:	linux-fsdevel@...r.kernel.org, ceph-devel@...r.kernel.org,
	linux-kernel@...r.kernel.org
cc:	Al Viro <viro@...IV.linux.org.uk>
Subject: [RFC] d_prune

The rcu path walk changes for 2.6.38 shone light in some dark corners 
where Ceph was using the dcache in racy ways.  The main problem is this:

 * The Ceph MDS server gives us lease information such that we know the
   contents for a directory won't change.
 * We want to do lookup on non-existant items without interacting with the
   server.  (We also want to do readdir, but that's a more complicated
   case.)
 * The existing hooks (d_release is what we were using) do not give us
   the opportunity to clear our "this directory is completely cached" flag
   prior to the dentry being unhashed.
 * d_lookup can't look at the "complete" flag and conclude a dentry 
   doesn't exist without worrying about a race with the pruner.

There are two cases where this matters:

 * The VFS does a lookup prior to any create, which means we do two server 
   requests instead of one.  Some VFS refactoring could probably fix that 
   (and Al has some ideas there).
 * A user looks up a non-existent file.  This should not require a server
   request at all.

The race we care about is with the pruner (shrink_dentry_list and 
shrink_dcache_for_umount_subtree).  Dropping dentries due to actual 
changes (rename, unlink, rmdir) all go through the usual d_ops where we 
have ample opportunity to the right thing (with the exception of one 
dentry_unhash in vfs_rename_dir() :/).

Is something like the patch below sane?  It notifies the fs prior to 
unhashing the dentry, giving Ceph the chance to clear its "complete" bit.  
Are there other reasons the VFS would drop dentries that I'm missing?

Some alternatives we've considered:

 * Double-caching.  We could keep a copy of directory contents in the fs
   layer and use that to do the lookup.  Yuck.
 * Put the "complete" bit in the dcache.  The problem is it's a flag on 
   the parent, d_flags is protected by d_lock, and we can't take the parent's
   d_lock while holding the child's d_lock (which we do while pruning).  
   Extra work that most fs's don't need.
 * Serializing lookups against the pruner in some other way.

Any thoughts or suggestions?

Thanks! 
sage


diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 4471a41..180e14b 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -28,6 +28,7 @@ d_revalidate: no              no              yes (ref-walk)  maybe
 d_hash         no              no              no              maybe
 d_compare:     yes             no              no              maybe
 d_delete:      no              yes             no              no
+d_prune:        no              yes             no              no
 d_release:     no              no              yes             no
 d_iput:                no              no              yes             no
 d_dname:       no              no              no              no
diff --git a/fs/dcache.c b/fs/dcache.c
index 2a6bd9a..cdb5d81 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -690,6 +690,8 @@ static void try_prune_one_dentry(struct dentry *dentry)
                        spin_unlock(&dentry->d_lock);
                        return;
                }
+               if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
+                       dentry->d_op->d_prune(dentry);
                dentry = dentry_kill(dentry, 1);
        }
 }
@@ -896,6 +898,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 
        /* detach this root from the system */
        spin_lock(&dentry->d_lock);
+       if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
+               dentry->d_op->d_prune(dentry);
        dentry_lru_del(dentry);
        __d_drop(dentry);
        spin_unlock(&dentry->d_lock);
@@ -912,6 +916,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
                                            d_u.d_child) {
                                spin_lock_nested(&loop->d_lock,
                                                DENTRY_D_LOCK_NESTED);
+                               if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
+                                       dentry->d_op->d_prune(dentry);
                                dentry_lru_del(loop);
                                __d_drop(loop);
                                spin_unlock(&loop->d_lock);
@@ -1375,6 +1381,8 @@ void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
                dentry->d_flags |= DCACHE_OP_REVALIDATE;
        if (op->d_delete)
                dentry->d_flags |= DCACHE_OP_DELETE;
+       if (op->d_prune)
+               dentry->d_flags |= DCACHE_OP_PRUNE;
 
 }
 EXPORT_SYMBOL(d_set_d_op);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f958c19..1e83bd8 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -165,6 +165,7 @@ struct dentry_operations {
                        unsigned int, const char *, const struct qstr *);
        int (*d_delete)(const struct dentry *);
        void (*d_release)(struct dentry *);
+       void (*d_prune)(struct dentry *);
        void (*d_iput)(struct dentry *, struct inode *);
        char *(*d_dname)(struct dentry *, char *, int);
        struct vfsmount *(*d_automount)(struct path *);
@@ -219,6 +220,8 @@ struct dentry_operations {
 #define DCACHE_MANAGED_DENTRY \
        (DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
 
+#define DCACHE_OP_PRUNE         0x80000
+
 extern seqlock_t rename_lock;
 
 static inline int dname_external(struct dentry *dentry)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ