linux-kernel - Re: [PATCH v2] mm: memcg/slab: fix memory leak at non-root kmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200715175432.GA6314@carbon.lan>
Date:   Wed, 15 Jul 2020 10:54:32 -0700
From:   Roman Gushchin <guro@...com>
To:     Muchun Song <songmuchun@...edance.com>
CC:     <vbabka@...e.cz>, <cl@...ux.com>, <penberg@...nel.org>,
        <rientjes@...gle.com>, <iamjoonsoo.kim@....com>,
        <akpm@...ux-foundation.org>, <shakeelb@...gle.com>,
        <linux-mm@...ck.org>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v2] mm: memcg/slab: fix memory leak at non-root
 kmem_cache destroy

On Thu, Jul 16, 2020 at 12:50:22AM +0800, Muchun Song wrote:
> If the kmem_cache refcount is greater than one, we should not
> mark the root kmem_cache as dying. If we mark the root kmem_cache
> dying incorrectly, the non-root kmem_cache can never be destroyed.
> It resulted in memory leak when memcg was destroyed. We can use the
> following steps to reproduce.
> 
>   1) Use kmem_cache_create() to create a new kmem_cache named A.
>   2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
>      so the refcount of B is just increased.
>   3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
>      decrease the B's refcount but mark the B as dying.
>   4) Create a new memory cgroup and alloc memory from the kmem_cache
>      B. It leads to create a non-root kmem_cache for allocating memory.
>   5) When destroy the memory cgroup created in the step 4), the
>      non-root kmem_cache can never be destroyed.
> 
> If we repeat steps 4) and 5), this will cause a lot of memory leak.
> So only when refcount reach zero, we mark the root kmem_cache as dying.
> 
> Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate")
> Signed-off-by: Muchun Song <songmuchun@...edance.com>
> Reviewed-by: Shakeel Butt <shakeelb@...gle.com>
> ---
> 
> changelog in v2:
>  1) Fix a confusing typo in the commit log.

Ok, now I see the problem. Thank you for fixing the commit log!

>  2) Remove flush_memcg_workqueue() for !CONFIG_MEMCG_KMEM.
>  3) Introduce a new helper memcg_set_kmem_cache_dying() to fix a race
>     condition between flush_memcg_workqueue() and slab_unmergeable(). 
> 
>  mm/slab_common.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 47 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/slab_common.c b/mm/slab_common.c
> index 8c1ffbf7de45..c4958116e3fd 100644
> --- a/mm/slab_common.c
> +++ b/mm/slab_common.c
> @@ -258,6 +258,11 @@ static void memcg_unlink_cache(struct kmem_cache *s)
>  		list_del(&s->memcg_params.kmem_caches_node);
>  	}
>  }
> +
> +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s)
> +{
> +	return is_root_cache(s) && s->memcg_params.dying;
> +}
>  #else
>  static inline int init_memcg_params(struct kmem_cache *s,
>  				    struct kmem_cache *root_cache)
> @@ -272,6 +277,11 @@ static inline void destroy_memcg_params(struct kmem_cache *s)
>  static inline void memcg_unlink_cache(struct kmem_cache *s)
>  {
>  }
> +
> +static inline bool memcg_kmem_cache_dying(struct kmem_cache *s)
> +{
> +	return false;
> +}
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  /*
> @@ -326,6 +336,13 @@ int slab_unmergeable(struct kmem_cache *s)
>  	if (s->refcount < 0)
>  		return 1;
>  
> +	/*
> +	 * If the kmem_cache is dying. We should also skip this
> +	 * kmem_cache.
> +	 */
> +	if (memcg_kmem_cache_dying(s))
> +		return 1;
> +
>  	return 0;
>  }
>  
> @@ -886,12 +903,15 @@ static int shutdown_memcg_caches(struct kmem_cache *s)
>  	return 0;
>  }
>  
> -static void flush_memcg_workqueue(struct kmem_cache *s)
> +static void memcg_set_kmem_cache_dying(struct kmem_cache *s)
>  {
>  	spin_lock_irq(&memcg_kmem_wq_lock);
>  	s->memcg_params.dying = true;
>  	spin_unlock_irq(&memcg_kmem_wq_lock);
> +}
>  
> +static void flush_memcg_workqueue(struct kmem_cache *s)
> +{
>  	/*
>  	 * SLAB and SLUB deactivate the kmem_caches through call_rcu. Make
>  	 * sure all registered rcu callbacks have been invoked.
> @@ -923,10 +943,6 @@ static inline int shutdown_memcg_caches(struct kmem_cache *s)
>  {
>  	return 0;
>  }
> -
> -static inline void flush_memcg_workqueue(struct kmem_cache *s)
> -{
> -}
>  #endif /* CONFIG_MEMCG_KMEM */
>  
>  void slab_kmem_cache_release(struct kmem_cache *s)
> @@ -944,8 +960,6 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  	if (unlikely(!s))
>  		return;
>  
> -	flush_memcg_workqueue(s);
> -
>  	get_online_cpus();
>  	get_online_mems();
>  
> @@ -955,6 +969,32 @@ void kmem_cache_destroy(struct kmem_cache *s)
>  	if (s->refcount)
>  		goto out_unlock;
>  
> +#ifdef CONFIG_MEMCG_KMEM
> +	memcg_set_kmem_cache_dying(s);
> +
> +	mutex_unlock(&slab_mutex);

Hm, but in theory s->refcount can be increased here?
So it doesn't solve the problem completely, but makes it less probable, right?

I wonder if it's possible to (additionally) protect s->refcount with a
memcg_kmem_wq_lock, so that we can check it in the context of flush_memcg_workqueue()?

> +
> +	put_online_mems();
> +	put_online_cpus();
> +
> +	flush_memcg_workqueue(s);
> +
> +	get_online_cpus();
> +	get_online_mems();
> +
> +	mutex_lock(&slab_mutex);
> +
> +	if (WARN(s->refcount,
> +		 "kmem_cache_destroy %s: Slab cache is still referenced\n",
> +		 s->name)) {
> +		/*
> +		 * Reset the dying flag setted by memcg_set_kmem_cache_dying().
> +		 */
> +		s->memcg_params.dying = false;
> +		goto out_unlock;
> +	}
> +#endif
> +
>  	err = shutdown_memcg_caches(s);
>  	if (!err)
>  		err = shutdown_cache(s);
> -- 
> 2.11.0
> 

Other than the problem above your patch looks really good to me, however we should
be really careful here, as it should in theory be back-ported to a big number
of old stable kernels. And because it's (hopefully) fixed in 5.9, it's a backport-only
patch.

So I wonder if we can mitigate the problem by disabling cache sharing for some
specific kmem_caches instead? Like for all with SLAB_ACCOUNT and maybe for all except
a hard-coded list (if kmem accounting is enabled). Do you mind sharing any details
on how this problem reveals itself in the real life?

Thanks!

PS I'm off the keyboard for the rest of today, will think more and hopefully
come back with some ideas tomorrow.