linux-kernel - Re: [PATCH 0/5] mm: reparent slab memory on cgroup removal

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite for Android: free password hash cracker in your pocket

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <20190418182714.GD11008@tower.DHCP.thefacebook.com>
Date:   Thu, 18 Apr 2019 18:27:17 +0000
From:   Roman Gushchin <guro@...com>
To:     Vladimir Davydov <vdavydov.dev@...il.com>
CC:     Roman Gushchin <guroan@...il.com>,
        Andrew Morton <akpm@...ux-foundation.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        Kernel Team <Kernel-team@...com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Rik van Riel <riel@...riel.com>,
        "david@...morbit.com" <david@...morbit.com>,
        Christoph Lameter <cl@...ux.com>,
        Pekka Enberg <penberg@...nel.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>
Subject: Re: [PATCH 0/5] mm: reparent slab memory on cgroup removal

On Thu, Apr 18, 2019 at 11:15:38AM +0300, Vladimir Davydov wrote:
> Hello Roman,
> 
> On Wed, Apr 17, 2019 at 02:54:29PM -0700, Roman Gushchin wrote:
> > There is however a significant problem with reparenting of slab memory:
> > there is no list of charged pages. Some of them are in shrinker lists,
> > but not all. Introducing of a new list is really not an option.
> 
> True, introducing a list of charged pages would negatively affect
> SL[AU]B performance since we would need to protect it with some kind
> of lock.
> 
> > 
> > But fortunately there is a way forward: every slab page has a stable pointer
> > to the corresponding kmem_cache. So the idea is to reparent kmem_caches
> > instead of slab pages.
> > 
> > It's actually simpler and cheaper, but requires some underlying changes:
> > 1) Make kmem_caches to hold a single reference to the memory cgroup,
> >    instead of a separate reference per every slab page.
> > 2) Stop setting page->mem_cgroup pointer for memcg slab pages and use
> >    page->kmem_cache->memcg indirection instead. It's used only on
> >    slab page release, so it shouldn't be a big issue.
> > 3) Introduce a refcounter for non-root slab caches. It's required to
> >    be able to destroy kmem_caches when they become empty and release
> >    the associated memory cgroup.
> 
> Which means an unconditional atomic inc/dec on charge/uncharge paths
> AFAIU. Note, we have per cpu batching so charging a kmem page in cgroup
> v2 doesn't require an atomic variable modification. I guess you could
> use some sort of per cpu ref counting though.

Yes, looks like I have to switch to the percpu counter (see the thread
with Shakeel).

> 
> Anyway, releasing mem_cgroup objects, but leaving kmem_cache objects
> dangling looks kinda awkward to me. It would be great if we could
> release both, but I assume it's hardly possible due to SL[AU]B
> complexity.

Kmem_caches are *much* smaller than memcgs. If the size of kmem_cache
is smaller than the size of objects which are pinning it, I think it's
acceptable. I hope to release all associated percpu memory early to make
it even smaller.

On the other hand memcgs are much larger than typical object which
are pinning it (dentries and inodes). And it rends to grow with new features
being added.

I agree that releasing both would be cool, but I doubt it's possible.

> 
> What about reusing dead cgroups instead? Yeah, it would be kinda unfair,
> because a fresh cgroup would get a legacy of objects left from previous
> owners, but still, if we delete a cgroup, the workload must be dead and
> so apart from a few long-lived objects, there should mostly be cached
> objects charged to it, which should be easily released on memory
> pressure. Sorry if somebody's asked this question before - I must have
> missed that.

It's an interesting idea. The problem is that the dying cgroup can be
an almost fully functional cgroup for a long time: it can have associated
sockets, pagecache, kernel objects, etc. It's a part of cgroup tree,
all constraints and limits are still applied, it might have some background
activity.

Thanks!