linux-kernel - Re: [rfc] superblock shrinker accumulating excessive deferred counts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20170717050638.GH17762@dastard>
Date:   Mon, 17 Jul 2017 15:06:38 +1000
From:   Dave Chinner <david@...morbit.com>
To:     David Rientjes <rientjes@...gle.com>
Cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        Greg Thelen <gthelen@...gle.com>,
        Andrew Morton <akpm@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        linux-kernel@...r.kernel.org
Subject: Re: [rfc] superblock shrinker accumulating excessive deferred counts

On Wed, Jul 12, 2017 at 01:42:35PM -0700, David Rientjes wrote:
> Hi Al and everyone,
> 
> We're encountering an issue where the per-shrinker per-node deferred 
> counts grow excessively large for the superblock shrinker.  This appears 
> to be long-standing behavior, so reaching out to you to see if there's any 
> subtleties being overlooked since there is a reference to memory pressure 
> and GFP_NOFS allocations growing total_scan purposefully.

There are plenty of land mines^W^Wsubtleties in this code....

> This is a side effect of super_cache_count() returning the appropriate 
> count but super_cache_scan() refusing to do anything about it and 
> immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.

Yup. Happens during things like memory allocations in filesystem
transaction context. e.g. when your memory pressure is generated by
GFP_NOFS allocations within transactions whilst doing directory
traversals (say 'chown -R' across an entire filesystem), then we
can't do direct reclaim on the caches that are generating the memory
pressure and so have to defer all the work to either kswapd or the
next GFP_KERNEL allocation context that triggers reclaim.

> An unlucky thread will grab the per-node shrinker->nr_deferred[nid] count 
> and increase it by
> 
> 	(2 * nr_scanned * super_cache_count()) / (nr_eligible + 1)
> 
> While total_scan is capped to a sane limit, and restricts the amount of 
> scanning that this thread actually does, if super_cache_scan() immediately 
> responds with SHRINK_STOP because of GFP_NOFS, the end result of doing any 
> of this is that nr_deferred just increased. 

Yes, by design.

> If we have a burst of 
> GFP_NOFS allocations, this grows it potentially very largely, which we 
> have seen in practice,

Yes, by design.

> and no matter how much __GFP_FS scanning is done 
> capped by total_scan, we can never fully get down to batch_count == 1024.

I don't see a batch_count variable in the shrinker code anywhere,
so I'm not sure what you mean by this.

Can you post a shrinker trace that shows the deferred count wind
up and then display the problem you're trying to describe?

> This seems troublesome to me and my first inclination was to avoid 
> counting *any* objects at all for GFP_NOFS but then I notice the comment 
> in do_shrink_slab():
> 
> 	/*
> 	 * We need to avoid excessive windup on filesystem shrinkers
> 	 * due to large numbers of GFP_NOFS allocations causing the
> 	 * shrinkers to return -1 all the time. This results in a large
> 	 * nr being built up so when a shrink that can do some work
> 	 * comes along it empties the entire cache due to nr >>>
> 	 * freeable. This is bad for sustaining a working set in
> 	 * memory.
> 	 *
> 	 * Hence only allow the shrinker to scan the entire cache when
> 	 * a large delta change is calculated directly.
> 	 */
> 
> I assume the comment is referring to "excessive windup" only in terms of 
> total_scan, although it doesn't impact next_deferred at all.  The problem 
> here seems to be next_deferred always grows extremely large.

"excessive windup" means the deferred count kept growing without
bound and so when work was finally able to be done, then amount of
work deferred would  trash the entire cache in one go. Think of a
spring - you can use it to smooth peaks and troughs in steady state
conditions, but transient conditions can wind the spring up so tight
that it can't be controlled when it is released. That's the
"excessive windup" part of the description above.

How do we control springs? By adding a damper to reduce the
speed at which it can react to large step changes, hence making it
harder to step outside the bounds of controlled behaviour. In this
case, the damper is the delta based clamping of total_scan.

i.e. light memory pressure generates small deltas, but we can have
so much GFP_NOFS allocation that we can still defer large amounts of
work. Under light memory pressure, we want to release this spring
more quickly than the current memory pressure indicates, but not so
fast that we create a great big explosion of work and unbalance the
system it is more important to maintain the working set in light
memory pressure conditions than it is to free lots of memory.

However, if we have heavy memory pressure (e.g. priority has wound
up) then the delta scan will cross the trigger threshold of "do lots
of work now, we need the memory" and we'll dump the entire deferred
work count into this execution of the shrinker, because memory is
needed right now....

> I'd like to do this, but am checking for anything subtle that this relies 
> on wrt memory pressure or implict intended behavior.

If we *don't* count and defer the work that we should have done
under GFP_NOFS reclaim contexts, we end up with caches that memory
reclaim will not shrink until GFP_NOFS generated memory pressure
stops completely. This is, generally speaking, bad for application
performance because they get blocked waiting for memory to be freed
from caches that memory reclaim can't put any significant pressure
on..

OTOH, if we don't damp down the deferred count scanning on small
deltas, then we end up with filesystem caches being trashed in light
memory pressure conditions. This is, generally speaking, bad for
workloads that rely on filesystem caches for performance (e.g git,
NFS servers, etc).

What we have now is effectively a brute force solution that finds
a decent middle ground most of the time. It's not perfect, but I'm
yet to find a better solution....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com