linux-kernel - Re: [rfc] superblock shrinker accumulating excessive deferred counts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1707171322100.123090@chino.kir.corp.google.com>
Date:   Mon, 17 Jul 2017 13:37:35 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Dave Chinner <david@...morbit.com>
cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        Greg Thelen <gthelen@...gle.com>,
        Andrew Morton <akpm@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [rfc] superblock shrinker accumulating excessive deferred
 counts

On Mon, 17 Jul 2017, Dave Chinner wrote:

> > This is a side effect of super_cache_count() returning the appropriate 
> > count but super_cache_scan() refusing to do anything about it and 
> > immediately terminating with SHRINK_STOP, mostly for GFP_NOFS allocations.
> 
> Yup. Happens during things like memory allocations in filesystem
> transaction context. e.g. when your memory pressure is generated by
> GFP_NOFS allocations within transactions whilst doing directory
> traversals (say 'chown -R' across an entire filesystem), then we
> can't do direct reclaim on the caches that are generating the memory
> pressure and so have to defer all the work to either kswapd or the
> next GFP_KERNEL allocation context that triggers reclaim.
> 

Thanks for looking into this, Dave!

The number of GFP_NOFS allocations that build up the deferred counts can 
be unbounded, however, so this can become excessive, and the oom killer 
will not kill any processes in this context.  Although the motivation to 
do additional reclaim because of past GFP_NOFS reclaim attempts is 
worthwhile, I think it should be limited because currently it only 
increases until something is able to start draining these excess counts.  
Having 10,000 GFP_NOFS reclaim attempts store up 
(2 * nr_scanned * freeable) / (nr_eligible + 1) objects 10,000 times 
such that it exceeds freeable by many magnitudes doesn't seem like a 
particularly useful thing.  For reference, we have seen nr_deferred for a 
single node to be > 10,000,000,000 in practice.  total_scan is limited to 
2 * freeable for each call to do_shrink_slab(), but such an excessive 
deferred count will guarantee it retries 2 * freeable each time instead of 
the proportion of lru scanned as intended.

What breaks if we limit the nr_deferred counts to freeable * 4, for 
example?

> > and no matter how much __GFP_FS scanning is done 
> > capped by total_scan, we can never fully get down to batch_count == 1024.
> 
> I don't see a batch_count variable in the shrinker code anywhere,
> so I'm not sure what you mean by this.
> 

batch_size == 1024, sorry.

> Can you post a shrinker trace that shows the deferred count wind
> up and then display the problem you're trying to describe?
> 

All threads contending on the list_lru's nlru->lock because they are all 
stuck in super_cache_count() while one thread is iterating through an 
excessive number of deferred objects in super_cache_scan(), contending for 
the same locks and nr_deferred never substantially goes down.

The problem with the superblock shrinker, which is why I emailed Al 
originally, is also that it is SHRINKER_MEMCG_AWARE.  Our 
list_lru_shrink_count() is only representative for the list_lru of 
sc->memcg, which is used in both super_cache_count() and 
super_cache_scan() for various math.  The nr_deferred counts from the 
do_shrink_slab() logic, however, are per-nid and, as such, various memcgs 
get penalized with excessive counts that they do not have freeable to 
begin with.