linux-kernel - Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200226214507.GE24185@bombadil.infradead.org>
Date:   Wed, 26 Feb 2020 13:45:07 -0800
From:   Matthew Wilcox <willy@...radead.org>
To:     Andreas Dilger <adilger@...ger.ca>
Cc:     Waiman Long <longman@...hat.com>,
        Alexander Viro <viro@...iv.linux.org.uk>,
        Jonathan Corbet <corbet@....net>,
        Luis Chamberlain <mcgrof@...nel.org>,
        Kees Cook <keescook@...omium.org>,
        Iurii Zaikin <yzaikin@...gle.com>,
        Linux Kernel Mailing List <linux-kernel@...r.kernel.org>,
        Linux FS Devel <linux-fsdevel@...r.kernel.org>,
        linux-doc@...r.kernel.org,
        Mauro Carvalho Chehab <mchehab+samsung@...nel.org>,
        Eric Biggers <ebiggers@...gle.com>,
        Dave Chinner <david@...morbit.com>,
        Eric Sandeen <sandeen@...hat.com>
Subject: Re: [PATCH 00/11] fs/dcache: Limit # of negative dentries

On Wed, Feb 26, 2020 at 02:28:50PM -0700, Andreas Dilger wrote:
> On Feb 26, 2020, at 9:29 AM, Matthew Wilcox <willy@...radead.org> wrote:
> > This is always the wrong approach.  A sysctl is just a way of blaming
> > the sysadmin for us not being very good at programming.
> > 
> > I agree that we need a way to limit the number of negative dentries.
> > But that limit needs to be dynamic and depend on how the system is being
> > used, not on how some overworked sysadmin has configured it.
> > 
> > So we need an initial estimate for the number of negative dentries that
> > we need for good performance.  Maybe it's 1000.  It doesn't really matter;
> > it's going to change dynamically.
> > 
> > Then we need a metric to let us know whether it needs to be increased.
> > Perhaps that's "number of new negative dentries created in the last
> > second".  And we need to decide how much to increase it; maybe it's by
> > 50% or maybe by 10%.  Perhaps somewhere between 10-100% depending on
> > how high the recent rate of negative dentry creation has been.
> > 
> > We also need a metric to let us know whether it needs to be decreased.
> > I'm reluctant to say that memory pressure should be that metric because
> > very large systems can let the number of dentries grow in an unbounded
> > way.  Perhaps that metric is "number of hits in the negative dentry
> > cache in the last ten seconds".  Again, we'll need to decide how much
> > to shrink the target number by.
> 
> OK, so now instead of a single tunable parameter we need three, because
> these numbers are totally made up and nobody knows the right values. :-)
> Defaulting the limit to "disabled/no limit" also has the problem that
> 99.99% of users won't even know this tunable exists, let alone how to
> set it correctly, so they will continue to see these problems, and the
> code may as well not exist (i.e. pure overhead), while Waiman has a
> better idea today of what would be reasonable defaults.

I never said "no limit".  I just said to start at some fairly random
value and not worry about where you start because it'll correct to where
this system needs it to be.  As long as it converges like loadavg does,
it'll be fine.  It needs a fairly large "don't change the target" area,
and it needs to react quickly to real changes in a system's workload.

> I definitely agree that a single fixed value will be wrong for every
> system except the original developer's.  Making the maximum default to
> some reasonable fraction of the system size, rather than a fixed value,
> is probably best to start.  Something like this as a starting point:
> 
> 	/* Allow a reasonable minimum number of negative entries,
> 	 * but proportionately more if the directory/dcache is large.
> 	 */
> 	dir_negative_max = max(num_dir_entries / 16, 1024);
>         total_negative_max = max(totalram_pages / 32, total_dentries / 8);

Those kinds of things are garbage on large machines.  With a terabyte
of RAM, you can end up with tens of millions of dentries clogging up
the system.  There _is_ an upper limit on the useful number of dentries
to keep around.

> (Waiman should decide actual values based on where the problem was hit
> previously), and include tunables to change the limits for testing.
> 
> Ideally there would also be a dir ioctl that allows fetching the current
> positive/negative entry count on a directory (e.g. /usr/bin, /usr/lib64,
> /usr/share/man/man*) to see what these values are.  Otherwise there is
> no way to determine whether the limits used are any good or not.

It definitely needs to be instrumented for testing, but no, not ioctls.
tracepoints, perhaps.

> Dynamic limits are hard to get right, and incorrect state machines can lead
> to wild swings in behaviour due to unexpected feedback.  It isn't clear to
> me that adjusting the limit based on the current rate of negative dentry
> creation even makes sense.  If there are a lot of negative entries being
> created, that is when you'd want to _stop_ allowing more to be added.

That doesn't make sense.  What you really want to know is "If my dcache
had twice as many entries in it, would that significantly reduce the
thrash of new entries being created".  In the page cache, we end up
with a double LRU where once-used entries fall off the list quickly
but twice-or-more used entries get to stay around for a bit longer.
Maybe we could do something like that; keep a victim cache for recently
evicted dentries, and if we get a large hit rate in the victim cache,
expand the size of the primary cache.