linux-kernel - Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20180706173025.nkpq5o2yfdtb7d7x@esperanza>
Date:   Fri, 6 Jul 2018 20:30:25 +0300
From:   Vladimir Davydov <vdavydov.dev@...il.com>
To:     Andrew Morton <akpm@...ux-foundation.org>
Cc:     Kirill Tkhai <ktkhai@...tuozzo.com>, shakeelb@...gle.com,
        viro@...iv.linux.org.uk, hannes@...xchg.org, mhocko@...nel.org,
        tglx@...utronix.de, pombredanne@...b.com, stummala@...eaurora.org,
        gregkh@...uxfoundation.org, sfr@...b.auug.org.au, guro@...com,
        mka@...omium.org, penguin-kernel@...ove.SAKURA.ne.jp,
        chris@...is-wilson.co.uk, longman@...hat.com, minchan@...nel.org,
        ying.huang@...el.com, mgorman@...hsingularity.net, jbacik@...com,
        linux@...ck-us.net, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, willy@...radead.org, lirongqing@...du.com,
        aryabinin@...tuozzo.com
Subject: Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

On Tue, Jul 03, 2018 at 01:50:00PM -0700, Andrew Morton wrote:
> On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@...tuozzo.com> wrote:
> 
> > Imagine a big node with many cpus, memory cgroups and containers.
> > Let we have 200 containers, every container has 10 mounts,
> > and 10 cgroups. All container tasks don't touch foreign
> > containers mounts. If there is intensive pages write,
> > and global reclaim happens, a writing task has to iterate
> > over all memcgs to shrink slab, before it's able to go
> > to shrink_page_list().
> > 
> > Iteration over all the memcg slabs is very expensive:
> > the task has to visit 200 * 10 = 2000 shrinkers
> > for every memcg, and since there are 2000 memcgs,
> > the total calls are 2000 * 2000 = 4000000.
> > 
> > So, the shrinker makes 4 million do_shrink_slab() calls
> > just to try to isolate SWAP_CLUSTER_MAX pages in one
> > of the actively writing memcg via shrink_page_list().
> > I've observed a node spending almost 100% in kernel,
> > making useless iteration over already shrinked slab.
> > 
> > This patch adds bitmap of memcg-aware shrinkers to memcg.
> > The size of the bitmap depends on bitmap_nr_ids, and during
> > memcg life it's maintained to be enough to fit bitmap_nr_ids
> > shrinkers. Every bit in the map is related to corresponding
> > shrinker id.
> > 
> > Next patches will maintain set bit only for really charged
> > memcg. This will allow shrink_slab() to increase its
> > performance in significant way. See the last patch for
> > the numbers.
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> >  	if (id < 0)
> >  		goto unlock;
> >  
> > +	if (memcg_expand_shrinker_maps(id)) {
> > +		idr_remove(&shrinker_idr, id);
> > +		goto unlock;
> > +	}
> > +
> >  	if (id >= shrinker_nr_max)
> >  		shrinker_nr_max = id + 1;
> >  	shrinker->id = id;
> 
> This function ends up being a rather sad little thing.
> 
> : static int prealloc_memcg_shrinker(struct shrinker *shrinker)
> : {
> : 	int id, ret = -ENOMEM;
> : 
> : 	down_write(&shrinker_rwsem);
> : 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
> : 	if (id < 0)
> : 		goto unlock;
> : 
> : 	if (memcg_expand_shrinker_maps(id)) {
> : 		idr_remove(&shrinker_idr, id);
> : 		goto unlock;
> : 	}
> : 
> : 	if (id >= shrinker_nr_max)
> : 		shrinker_nr_max = id + 1;
> : 	shrinker->id = id;
> : 	ret = 0;
> : unlock:
> : 	up_write(&shrinker_rwsem);
> : 	return ret;
> : }
> 
> - there's no need to call memcg_expand_shrinker_maps() unless id >=
>   shrinker_nr_max so why not move the code and avoid calling
>   memcg_expand_shrinker_maps() in most cases.

memcg_expand_shrinker_maps will return immediately if per memcg shrinker
maps can accommodate the new id. Since prealloc_memcg_shrinker is
definitely not a hot path, I don't see any penalty in calling this
function on each prealloc_memcg_shrinker invocation.

> 
> - why aren't we decreasing shrinker_nr_max in
>   unregister_memcg_shrinker()?  That's easy to do, avoids pointless
>   work in shrink_slab_memcg() and avoids memory waste in future
>   prealloc_memcg_shrinker() calls.

We can shrink the maps, but IMHO it isn't worth the complexity it would
introduce, because in my experience if a workload used N mount points
(containers, whatever) at some point of its lifetime, it is likely to
use the same amount in the future.

> 
>   It should be possible to find the highest ID in an IDR tree with a
>   straightforward descent of the underlying radix tree, but I doubt if
>   that has been wired up.  Otherwise a simple loop in
>   unregister_memcg_shrinker() would be needed.
> 
>