lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date:	Mon, 7 Mar 2011 09:53:05 -0800 (PST)
From:	Sage Weil <sage@...dream.net>
To:	Al Viro <viro@...IV.linux.org.uk>, linux-fsdevel@...r.kernel.org,
	ceph-devel@...r.kernel.org, linux-kernel@...r.kernel.org
cc:	mszeredi@...e.cz
Subject: Re: [RFC] d_prune

On Fri, 4 Mar 2011, Sage Weil wrote:
> The rcu path walk changes for 2.6.38 shone light in some dark corners 
> where Ceph was using the dcache in racy ways.  The main problem is this:
> 
>  * The Ceph MDS server gives us lease information such that we know the
>    contents for a directory won't change.
>  * We want to do lookup on non-existant items without interacting with the
>    server.  (We also want to do readdir, but that's a more complicated
>    case.)
>  * The existing hooks (d_release is what we were using) do not give us
>    the opportunity to clear our "this directory is completely cached" flag
>    prior to the dentry being unhashed.
>  * d_lookup can't look at the "complete" flag and conclude a dentry 
>    doesn't exist without worrying about a race with the pruner.
> 
> There are two cases where this matters:
> 
>  * The VFS does a lookup prior to any create, which means we do two server 
>    requests instead of one.  Some VFS refactoring could probably fix that 
>    (and Al has some ideas there).
>  * A user looks up a non-existent file.  This should not require a server
>    request at all.
> 
> The race we care about is with the pruner (shrink_dentry_list and 
> shrink_dcache_for_umount_subtree).  Dropping dentries due to actual 
> changes (rename, unlink, rmdir) all go through the usual d_ops where we 
> have ample opportunity to the right thing (with the exception of one 
> dentry_unhash in vfs_rename_dir() :/).

Unless there are other (non-pruner) cases where the VFS randomly unhashes 
things, I think vfs_rename_dir() is the main problem with the d_prune 
method approach.  Specifically, vfs_rename_dir() does

                if (target)
                        dentry_unhash(new_dentry);

prior to calling ->rename().  It's not clear to me why vfs_rename_dir() 
does this but vfs_rename_other() does not.  Could it be pushed down into 
the ->rename() method?

Also, the 

                if (d_unhashed(new_dentry))
                        d_rehash(new_dentry);

after ->rename() has always looked fishy/buggy to me.  See

        http://lkml.org/lkml/2008/7/30/165

Thanks-
sage



> Is something like the patch below sane?  It notifies the fs prior to 
> unhashing the dentry, giving Ceph the chance to clear its "complete" bit.  
> Are there other reasons the VFS would drop dentries that I'm missing?
> 
> Some alternatives we've considered:
> 
>  * Double-caching.  We could keep a copy of directory contents in the fs
>    layer and use that to do the lookup.  Yuck.
>  * Put the "complete" bit in the dcache.  The problem is it's a flag on 
>    the parent, d_flags is protected by d_lock, and we can't take the parent's
>    d_lock while holding the child's d_lock (which we do while pruning).  
>    Extra work that most fs's don't need.
>  * Serializing lookups against the pruner in some other way.
> 
> Any thoughts or suggestions?
> 
> Thanks! 
> sage
> 
> 
> diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> index 4471a41..180e14b 100644
> --- a/Documentation/filesystems/Locking
> +++ b/Documentation/filesystems/Locking
> @@ -28,6 +28,7 @@ d_revalidate: no              no              yes (ref-walk)  maybe
>  d_hash         no              no              no              maybe
>  d_compare:     yes             no              no              maybe
>  d_delete:      no              yes             no              no
> +d_prune:        no              yes             no              no
>  d_release:     no              no              yes             no
>  d_iput:                no              no              yes             no
>  d_dname:       no              no              no              no
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 2a6bd9a..cdb5d81 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -690,6 +690,8 @@ static void try_prune_one_dentry(struct dentry *dentry)
>                         spin_unlock(&dentry->d_lock);
>                         return;
>                 }
> +               if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
> +                       dentry->d_op->d_prune(dentry);
>                 dentry = dentry_kill(dentry, 1);
>         }
>  }
> @@ -896,6 +898,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>  
>         /* detach this root from the system */
>         spin_lock(&dentry->d_lock);
> +       if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
> +               dentry->d_op->d_prune(dentry);
>         dentry_lru_del(dentry);
>         __d_drop(dentry);
>         spin_unlock(&dentry->d_lock);
> @@ -912,6 +916,8 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
>                                             d_u.d_child) {
>                                 spin_lock_nested(&loop->d_lock,
>                                                 DENTRY_D_LOCK_NESTED);
> +                               if (dentry->d_op->d_flags & DCACHE_OP_PRUNE)
> +                                       dentry->d_op->d_prune(dentry);
>                                 dentry_lru_del(loop);
>                                 __d_drop(loop);
>                                 spin_unlock(&loop->d_lock);
> @@ -1375,6 +1381,8 @@ void d_set_d_op(struct dentry *dentry, const struct dentry_operations *op)
>                 dentry->d_flags |= DCACHE_OP_REVALIDATE;
>         if (op->d_delete)
>                 dentry->d_flags |= DCACHE_OP_DELETE;
> +       if (op->d_prune)
> +               dentry->d_flags |= DCACHE_OP_PRUNE;
>  
>  }
>  EXPORT_SYMBOL(d_set_d_op);
> diff --git a/include/linux/dcache.h b/include/linux/dcache.h
> index f958c19..1e83bd8 100644
> --- a/include/linux/dcache.h
> +++ b/include/linux/dcache.h
> @@ -165,6 +165,7 @@ struct dentry_operations {
>                         unsigned int, const char *, const struct qstr *);
>         int (*d_delete)(const struct dentry *);
>         void (*d_release)(struct dentry *);
> +       void (*d_prune)(struct dentry *);
>         void (*d_iput)(struct dentry *, struct inode *);
>         char *(*d_dname)(struct dentry *, char *, int);
>         struct vfsmount *(*d_automount)(struct path *);
> @@ -219,6 +220,8 @@ struct dentry_operations {
>  #define DCACHE_MANAGED_DENTRY \
>         (DCACHE_MOUNTED|DCACHE_NEED_AUTOMOUNT|DCACHE_MANAGE_TRANSIT)
>  
> +#define DCACHE_OP_PRUNE         0x80000
> +
>  extern seqlock_t rename_lock;
>  
>  static inline int dname_external(struct dentry *dentry)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@...r.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@...r.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ