linux-kernel - Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to memcg

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-Id: <20180703135000.b2322ae0e514f028e7941d3c@linux-foundation.org>
Date:   Tue, 3 Jul 2018 13:50:00 -0700
From:   Andrew Morton <akpm@...ux-foundation.org>
To:     Kirill Tkhai <ktkhai@...tuozzo.com>
Cc:     vdavydov.dev@...il.com, shakeelb@...gle.com,
        viro@...iv.linux.org.uk, hannes@...xchg.org, mhocko@...nel.org,
        tglx@...utronix.de, pombredanne@...b.com, stummala@...eaurora.org,
        gregkh@...uxfoundation.org, sfr@...b.auug.org.au, guro@...com,
        mka@...omium.org, penguin-kernel@...ove.SAKURA.ne.jp,
        chris@...is-wilson.co.uk, longman@...hat.com, minchan@...nel.org,
        ying.huang@...el.com, mgorman@...hsingularity.net, jbacik@...com,
        linux@...ck-us.net, linux-kernel@...r.kernel.org,
        linux-mm@...ck.org, willy@...radead.org, lirongqing@...du.com,
        aryabinin@...tuozzo.com
Subject: Re: [PATCH v8 05/17] mm: Assign memcg-aware shrinkers bitmap to
 memcg

On Tue, 03 Jul 2018 18:09:26 +0300 Kirill Tkhai <ktkhai@...tuozzo.com> wrote:

> Imagine a big node with many cpus, memory cgroups and containers.
> Let we have 200 containers, every container has 10 mounts,
> and 10 cgroups. All container tasks don't touch foreign
> containers mounts. If there is intensive pages write,
> and global reclaim happens, a writing task has to iterate
> over all memcgs to shrink slab, before it's able to go
> to shrink_page_list().
> 
> Iteration over all the memcg slabs is very expensive:
> the task has to visit 200 * 10 = 2000 shrinkers
> for every memcg, and since there are 2000 memcgs,
> the total calls are 2000 * 2000 = 4000000.
> 
> So, the shrinker makes 4 million do_shrink_slab() calls
> just to try to isolate SWAP_CLUSTER_MAX pages in one
> of the actively writing memcg via shrink_page_list().
> I've observed a node spending almost 100% in kernel,
> making useless iteration over already shrinked slab.
> 
> This patch adds bitmap of memcg-aware shrinkers to memcg.
> The size of the bitmap depends on bitmap_nr_ids, and during
> memcg life it's maintained to be enough to fit bitmap_nr_ids
> shrinkers. Every bit in the map is related to corresponding
> shrinker id.
> 
> Next patches will maintain set bit only for really charged
> memcg. This will allow shrink_slab() to increase its
> performance in significant way. See the last patch for
> the numbers.
> 
> ...
>
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -182,6 +182,11 @@ static int prealloc_memcg_shrinker(struct shrinker *shrinker)
>  	if (id < 0)
>  		goto unlock;
>  
> +	if (memcg_expand_shrinker_maps(id)) {
> +		idr_remove(&shrinker_idr, id);
> +		goto unlock;
> +	}
> +
>  	if (id >= shrinker_nr_max)
>  		shrinker_nr_max = id + 1;
>  	shrinker->id = id;

This function ends up being a rather sad little thing.

: static int prealloc_memcg_shrinker(struct shrinker *shrinker)
: {
: 	int id, ret = -ENOMEM;
: 
: 	down_write(&shrinker_rwsem);
: 	id = idr_alloc(&shrinker_idr, shrinker, 0, 0, GFP_KERNEL);
: 	if (id < 0)
: 		goto unlock;
: 
: 	if (memcg_expand_shrinker_maps(id)) {
: 		idr_remove(&shrinker_idr, id);
: 		goto unlock;
: 	}
: 
: 	if (id >= shrinker_nr_max)
: 		shrinker_nr_max = id + 1;
: 	shrinker->id = id;
: 	ret = 0;
: unlock:
: 	up_write(&shrinker_rwsem);
: 	return ret;
: }

- there's no need to call memcg_expand_shrinker_maps() unless id >=
  shrinker_nr_max so why not move the code and avoid calling
  memcg_expand_shrinker_maps() in most cases.

- why aren't we decreasing shrinker_nr_max in
  unregister_memcg_shrinker()?  That's easy to do, avoids pointless
  work in shrink_slab_memcg() and avoids memory waste in future
  prealloc_memcg_shrinker() calls.

  It should be possible to find the highest ID in an IDR tree with a
  straightforward descent of the underlying radix tree, but I doubt if
  that has been wired up.  Otherwise a simple loop in
  unregister_memcg_shrinker() would be needed.