linux-kernel - [rfc] superblock shrinker accumulating excessive deferred counts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1707121329400.57341@chino.kir.corp.google.com>
Date:   Wed, 12 Jul 2017 13:42:35 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Alexander Viro <viro@...iv.linux.org.uk>
cc:     Greg Thelen <gthelen@...gle.com>, Andrew Morton <akpm@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Dave Chinner <david@...morbit.com>,
        linux-kernel@...r.kernel.org
Subject: [rfc] superblock shrinker accumulating excessive deferred counts

Hi Al and everyone,

We're encountering an issue where the per-shrinker per-node deferred 
counts grow excessively large for the superblock shrinker.  This appears 
to be long-standing behavior, so reaching out to you to see if there's any 
subtleties being overlooked since there is a reference to memory pressure 
and GFP_NOFS allocations growing total_scan purposefully.

This is a side effect of super_cache_count() returning the appropriate 
count but super_cache_scan() refusing to do anything about it and 
immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.

An unlucky thread will grab the per-node shrinker->nr_deferred[nid] count 
and increase it by

	(2 * nr_scanned * super_cache_count()) / (nr_eligible + 1)

While total_scan is capped to a sane limit, and restricts the amount of 
scanning that this thread actually does, if super_cache_scan() immediately 
responds with SHRINK_STOP because of GFP_NOFS, the end result of doing any 
of this is that nr_deferred just increased.  If we have a burst of 
GFP_NOFS allocations, this grows it potentially very largely, which we 
have seen in practice, and no matter how much __GFP_FS scanning is done 
capped by total_scan, we can never fully get down to batch_count == 1024.

This seems troublesome to me and my first inclination was to avoid 
counting *any* objects at all for GFP_NOFS but then I notice the comment 
in do_shrink_slab():

	/*
	 * We need to avoid excessive windup on filesystem shrinkers
	 * due to large numbers of GFP_NOFS allocations causing the
	 * shrinkers to return -1 all the time. This results in a large
	 * nr being built up so when a shrink that can do some work
	 * comes along it empties the entire cache due to nr >>>
	 * freeable. This is bad for sustaining a working set in
	 * memory.
	 *
	 * Hence only allow the shrinker to scan the entire cache when
	 * a large delta change is calculated directly.
	 */

I assume the comment is referring to "excessive windup" only in terms of 
total_scan, although it doesn't impact next_deferred at all.  The problem 
here seems to be next_deferred always grows extremely large.

I'd like to do this, but am checking for anything subtle that this relies 
on wrt memory pressure or implict intended behavior.

Thanks for looking at this!
---
 fs/super.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/super.c b/fs/super.c
--- a/fs/super.c
+++ b/fs/super.c
@@ -65,13 +65,6 @@ static unsigned long super_cache_scan(struct shrinker *shrink,

 	sb = container_of(shrink, struct super_block, s_shrink);

-	/*
-	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
-	 * to recurse into the FS that called us in clear_inode() and friends..
-	 */
-	if (!(sc->gfp_mask & __GFP_FS))
-		return SHRINK_STOP;
-
 	if (!trylock_super(sb))
 		return SHRINK_STOP;

@@ -116,6 +109,13 @@ static unsigned long super_cache_count(struct shrinker *shrink,
 	struct super_block *sb;
 	long	total_objects = 0;

+	/*
+	 * Deadlock avoidance.  We may hold various FS locks, and we don't want
+	 * to recurse into the FS that called us in clear_inode() and friends..
+	 */
+	if (!(sc->gfp_mask & __GFP_FS))
+		return 0;
+
 	sb = container_of(shrink, struct super_block, s_shrink);

 	/*