linux-kernel - Re: [rfc] superblock shrinker accumulating excessive deferred counts

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <alpine.DEB.2.10.1707181718230.55764@chino.kir.corp.google.com>
Date:   Tue, 18 Jul 2017 17:28:14 -0700 (PDT)
From:   David Rientjes <rientjes@...gle.com>
To:     Dave Chinner <david@...morbit.com>
cc:     Alexander Viro <viro@...iv.linux.org.uk>,
        Greg Thelen <gthelen@...gle.com>,
        Andrew Morton <akpm@...gle.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        Hugh Dickins <hughd@...gle.com>, linux-kernel@...r.kernel.org
Subject: Re: [rfc] superblock shrinker accumulating excessive deferred
 counts

On Tue, 18 Jul 2017, Dave Chinner wrote:

> > Thanks for looking into this, Dave!
> > 
> > The number of GFP_NOFS allocations that build up the deferred counts can 
> > be unbounded, however, so this can become excessive, and the oom killer 
> > will not kill any processes in this context.  Although the motivation to 
> > do additional reclaim because of past GFP_NOFS reclaim attempts is 
> > worthwhile, I think it should be limited because currently it only 
> > increases until something is able to start draining these excess counts.  
> 
> Usually kswapd is kicked in by this point and starts doing work. Why
> isn't kswapd doing the shrinker work in the background?
> 

It is, and often gets preempted itself while in lru scanning or 
shrink_slab(), most often super_cache_count() itself.  The issue is that 
it gets preempted by networking packets being sent in irq context which 
ends up eating up GFP_ATOMIC memory.  One of the key traits of this is 
that per-zone free memory is far below the min watermarks so not only is 
there insufficient memory for GFP_NOFS, but also insufficient memory for 
GFP_ATOMIC.  Kswapd will only slab shrink a proportion of the lru scanned 
if it is not lucky enough to grab the excess nr_deferred.  And meanwhile 
other threads end up increasing it.

It's various workloads and I can't show a specific example of GFP_NOFS 
allocations in flight because we have made changes to prevent this, 
specifically ignoring nr_deferred counts for SHRINKER_MEMCG_AWARE 
shrinkers since they are largely erroneous.  This can also occur if we 
cannot grab the trylock on the superblock itself.

> Ugh. The per-node lru list count was designed to run unlocked and so
> avoid this sort of (known) scalability problem.
> 
> Ah, see the difference between list_lru_count_node() and
> list_lru_count_one(). list_lru_count_one() should only take locks
> for memcg lookups if it is trying to shrink a memcg. That needs to
> be fixed before anything else and, if possible, the memcg lookup be
> made lockless....
> 

We've done that as part of this fix, actually, by avoiding doing resizing 
of these list_lru's when the number of memcg cache ids increase.  We just 
preallocate the max amount, MEMCG_CACHES_MAX_SIZE, to do lockless reads 
since the lock there is only needed to prevent concurrent remapping.

> Yup, the memcg shrinking was shoe-horned into the per-node LRU
> infrastructure, and the high level accounting is completely unaware
> of the fact that memcgs have their own private LRUs. We left the
> windup in place because slab caches are shared, and it's possible
> that memory can't be freed because pages have objects from different
> memcgs pinning them. Hence we need to bleed at least some of that
> "we can't make progress" count back into the global "deferred
> reclaim" pool to get other contexts to do some reclaim.
> 

Right, now we've patched our kernel to avoid looking at the nr_deferred 
count for SHRINKER_MEMCG_AWARE but that's obviously a short-term solution, 
and I'm not sure that we can spare the tax to get per-memcg per-node 
deferred counts.  It seems that some other metadata would be needed in 
this case to indicate excessive windup for slab shrinking that cannot 
actually do any scanning in super_cache_scan().  

Vladimir, do you have a suggestion, or is there someone else that is 
working on this?