lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <922c747c22b05a80a8350ac87b839eed0c79581f.camel@themaw.net>
Date:   Fri, 04 Jun 2021 09:07:43 +0800
From:   Ian Kent <raven@...maw.net>
To:     Miklos Szeredi <miklos@...redi.hu>
Cc:     Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>, Eric Sandeen <sandeen@...deen.net>,
        Fox Chen <foxhlchen@...il.com>,
        Brice Goglin <brice.goglin@...il.com>,
        Al Viro <viro@...iv.linux.org.uk>,
        Rick Lindsley <ricklind@...ux.vnet.ibm.com>,
        David Howells <dhowells@...hat.com>,
        Marcelo Tosatti <mtosatti@...hat.com>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        Kernel Mailing List <linux-kernel@...r.kernel.org>
Subject: Re: [REPOST PATCH v4 2/5] kernfs: use VFS negative dentry caching

On Fri, 2021-06-04 at 07:57 +0800, Ian Kent wrote:
> On Thu, 2021-06-03 at 10:15 +0800, Ian Kent wrote:
> > On Wed, 2021-06-02 at 18:57 +0800, Ian Kent wrote:
> > > On Wed, 2021-06-02 at 10:58 +0200, Miklos Szeredi wrote:
> > > > On Wed, 2 Jun 2021 at 05:44, Ian Kent <raven@...maw.net> wrote:
> > > > > 
> > > > > On Tue, 2021-06-01 at 14:41 +0200, Miklos Szeredi wrote:
> > > > > > On Fri, 28 May 2021 at 08:34, Ian Kent <raven@...maw.net>
> > > > > > wrote:
> > > > > > > 
> > > > > > > If there are many lookups for non-existent paths these
> > > > > > > negative
> > > > > > > lookups
> > > > > > > can lead to a lot of overhead during path walks.
> > > > > > > 
> > > > > > > The VFS allows dentries to be created as negative and
> > > > > > > hashed,
> > > > > > > and
> > > > > > > caches
> > > > > > > them so they can be used to reduce the fairly high
> > > > > > > overhead
> > > > > > > alloc/free
> > > > > > > cycle that occurs during these lookups.
> > > > > > 
> > > > > > Obviously there's a cost associated with negative caching
> > > > > > too. 
> > > > > > For
> > > > > > normal filesystems it's trivially worth that cost, but in
> > > > > > case
> > > > > > of
> > > > > > kernfs, not sure...
> > > > > > 
> > > > > > Can "fairly high" be somewhat substantiated with a
> > > > > > microbenchmark
> > > > > > for
> > > > > > negative lookups?
> > > > > 
> > > > > Well, maybe, but anything we do for a benchmark would be
> > > > > totally
> > > > > artificial.
> > > > > 
> > > > > The reason I added this is because I saw appreciable
> > > > > contention
> > > > > on the dentry alloc path in one case I saw.
> > > > 
> > > > If multiple tasks are trying to look up the same negative
> > > > dentry
> > > > in
> > > > parallel, then there will be contention on the parent inode
> > > > lock.
> > > > Was this the issue?   This could easily be reproduced with an
> > > > artificial benchmark.
> > > 
> > > Not that I remember, I'll need to dig up the sysrq dumps to have
> > > a
> > > look and get back to you.
> > 
> > After doing that though I could grab Fox Chen's reproducer and give
> > it varying sysfs paths as well as some percentage of non-existent
> > sysfs paths and see what I get ...
> > 
> > That should give it a more realistic usage profile and, if I can
> > get the percentage of non-existent paths right, demonstrate that
> > case as well ... but nothing is easy, so we'll have to wait and
> > see, ;)
> 
> Ok, so I grabbed Fox's benckmark repo. and used a non-existent path
> to check the negative dentry contention.
> 
> I've taken the baseline readings and the contention is see is the
> same as I originally saw. It's with d_alloc_parallel() on lockref.
> 
> While I haven't run the patched check I'm pretty sure that using
> dget_parent() and taking a snapshot will move the contention to
> that. So if I do retain the negative dentry caching change I would
> need to use the dentry seq lock for it to be useful.
> 
> Thoughts Miklos, anyone?

Mmm ... never mind, I'd still need to take a snapshot anyway and
dget_parent() looks lightweight if there's no conflict. I will
need to test it.

> 
> > 
> > > 
> > > > 
> > > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > > > index 4c69e2af82dac..5151c712f06f5 100644
> > > > > > > --- a/fs/kernfs/dir.c
> > > > > > > +++ b/fs/kernfs/dir.c
> > > > > > > @@ -1037,12 +1037,33 @@ static int
> > > > > > > kernfs_dop_revalidate(struct
> > > > > > > dentry *dentry, unsigned int flags)
> > > > > > >         if (flags & LOOKUP_RCU)
> > > > > > >                 return -ECHILD;
> > > > > > > 
> > > > > > > -       /* Always perform fresh lookup for negatives */
> > > > > > > -       if (d_really_is_negative(dentry))
> > > > > > > -               goto out_bad_unlocked;
> > > > > > > +       mutex_lock(&kernfs_mutex);
> > > > > > > 
> > > > > > >         kn = kernfs_dentry_node(dentry);
> > > > > > > -       mutex_lock(&kernfs_mutex);
> > > > > > > +
> > > > > > > +       /* Negative hashed dentry? */
> > > > > > > +       if (!kn) {
> > > > > > > +               struct kernfs_node *parent;
> > > > > > > +
> > > > > > > +               /* If the kernfs node can be found this
> > > > > > > is
> > > > > > > a
> > > > > > > stale
> > > > > > > negative
> > > > > > > +                * hashed dentry so it must be discarded
> > > > > > > and
> > > > > > > the
> > > > > > > lookup redone.
> > > > > > > +                */
> > > > > > > +               parent = kernfs_dentry_node(dentry-
> > > > > > > > d_parent);
> > > > > > 
> > > > > > This doesn't look safe WRT a racing sys_rename().  In this
> > > > > > case
> > > > > > d_move() is called only with parent inode locked, but not
> > > > > > with
> > > > > > kernfs_mutex while ->d_revalidate() may not have parent
> > > > > > inode
> > > > > > locked.
> > > > > > After d_move() the old parent dentry can be freed,
> > > > > > resulting
> > > > > > in
> > > > > > use
> > > > > > after free.  Easily fixed by dget_parent().
> > > > > 
> > > > > Umm ... I'll need some more explanation here ...
> > > > > 
> > > > > We are in ref-walk mode so the parent dentry isn't going
> > > > > away.
> > > > 
> > > > The parent that was used to lookup the dentry in __d_lookup()
> > > > isn't
> > > > going away.  But it's not necessarily equal to dentry->d_parent
> > > > anymore.
> > > > 
> > > > > And this is a negative dentry so rename is going to bail out
> > > > > with ENOENT way early.
> > > > 
> > > > You are right.  But note that negative dentry in question could
> > > > be
> > > > the
> > > > target of a rename.  Current implementation doesn't switch the
> > > > target's parent or name, but this wasn't always the case
> > > > (commit
> > > > 076515fc9267 ("make non-exchanging __d_move() copy ->d_parent
> > > > rather
> > > > than swap them")), so a backport of this patch could become
> > > > incorrect
> > > > on old enough kernels.
> > > 
> > > Right, that __lookup_hash() will find the negative target.
> > > 
> > > > 
> > > > So I still think using dget_parent() is the correct way to do
> > > > this.
> > > 
> > > The rename code does my head in, ;)
> > > 
> > > The dget_parent() would ensure we had an up to date parent so
> > > yes, that would be the right thing to do regardless.
> > > 
> > > But now I'm not sure that will be sufficient for kernfs. I'm
> > > still
> > > thinking about it.
> > > 
> > > I'm wondering if there's a missing check in there to account for
> > > what happens with revalidate after ->rename() but before move.
> > > There's already a kernfs node check in there so it's probably ok
> > > ...
> > >  
> > > > 
> > > > > > 
> > > > > > > +               if (parent) {
> > > > > > > +                       const void *ns = NULL;
> > > > > > > +
> > > > > > > +                       if (kernfs_ns_enabled(parent))
> > > > > > > +                               ns = kernfs_info(dentry-
> > > > > > > > d_sb)-
> > > > > > > > ns;
> > > > > > > +                       kn = kernfs_find_ns(parent,
> > > > > > > dentry-
> > > > > > > > d_name.name, ns);
> > > > > > 
> > > > > > Same thing with d_name.  There's
> > > > > > take_dentry_name_snapshot()/release_dentry_name_snapshot()
> > > > > > to
> > > > > > properly
> > > > > > take care of that.
> > > > > 
> > > > > I don't see that problem either, due to the dentry being
> > > > > negative,
> > > > > but please explain what your seeing here.
> > > > 
> > > > Yeah.  Negative dentries' names weren't always stable, but that
> > > > was
> > > > a
> > > > long time ago (commit 8d85b4845a66 ("Allow sharing external
> > > > names
> > > > after __d_move()")).
> > > 
> > > Right, I'll make that change too.
> > > 
> > > > 
> > > > Thanks,
> > > > Miklos
> > > 
> > 
> 


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ