linux-kernel - Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200204011505.GA9138@xps.dhcp.thefacebook.com>
Date:   Mon, 3 Feb 2020 17:15:05 -0800
From:   Roman Gushchin <guro@...com>
To:     Johannes Weiner <hannes@...xchg.org>
CC:     <linux-mm@...ck.org>, Andrew Morton <akpm@...ux-foundation.org>,
        Michal Hocko <mhocko@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>,
        Vladimir Davydov <vdavydov.dev@...il.com>,
        <linux-kernel@...r.kernel.org>, <kernel-team@...com>,
        Bharata B Rao <bharata@...ux.ibm.com>,
        Yafang Shao <laoar.shao@...il.com>
Subject: Re: [PATCH v2 21/28] mm: memcg/slab: use a single set of kmem_caches
 for all memory cgroups

On Mon, Feb 03, 2020 at 05:17:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 03, 2020 at 12:58:34PM -0800, Roman Gushchin wrote:
> > On Mon, Feb 03, 2020 at 02:50:48PM -0500, Johannes Weiner wrote:
> > > On Mon, Jan 27, 2020 at 09:34:46AM -0800, Roman Gushchin wrote:
> > > > This is fairly big but mostly red patch, which makes all non-root
> > > > slab allocations use a single set of kmem_caches instead of
> > > > creating a separate set for each memory cgroup.
> > > > 
> > > > Because the number of non-root kmem_caches is now capped by the number
> > > > of root kmem_caches, there is no need to shrink or destroy them
> > > > prematurely. They can be perfectly destroyed together with their
> > > > root counterparts. This allows to dramatically simplify the
> > > > management of non-root kmem_caches and delete a ton of code.
> > > 
> > > This is definitely going in the right direction. But it doesn't quite
> > > explain why we still need two sets of kmem_caches?
> > > 
> > > In the old scheme, we had completely separate per-cgroup caches with
> > > separate slab pages. If a cgrouped process wanted to allocate a slab
> > > object, we'd go to the root cache and used the cgroup id to look up
> > > the right cgroup cache. On slab free we'd use page->slab_cache.
> > > 
> > > Now we have slab pages that have a page->objcg array. Why can't all
> > > allocations go through a single set of kmem caches? If an allocation
> > > is coming from a cgroup and the slab page the allocator wants to use
> > > doesn't have an objcg array yet, we can allocate it on the fly, no?
> > 
> > Well, arguably it can be done, but there are few drawbacks:
> > 
> > 1) On the release path you'll need to make some extra work even for
> >    root allocations: calculate the offset only to find the NULL objcg pointer.
> > 
> > 2) There will be a memory overhead for root allocations
> >    (which might or might not be compensated by the increase
> >    of the slab utilization).
> 
> Those two are only true if there is a wild mix of root and cgroup
> allocations inside the same slab, and that doesn't really happen in
> practice. Either the machine is dedicated to one workload and cgroups
> are only enabled due to e.g. a vendor kernel, or you have cgrouped
> systems (like most distro systems now) that cgroup everything.

It's actually a questionable statement: we do skip allocations from certain
contexts, and we do merge slab caches.

Most likely it's true for certain slab_caches and not true for others.
Think of kmalloc-* caches.

Also, because obj_cgroup vectors will not be freed without underlying pages,
most likely the percentage of pages with obj_cgroups will grow with uptime.
In other words, memcg allocations will fragment root slab pages.

> 
> > 3) I'm working on percpu memory accounting that resembles the same scheme,
> >    except that obj_cgroups vector is created for the whole percpu block.
> >    There will be root- and memcg-blocks, and it will be expensive to merge them.
> >    I kinda like using the same scheme here and there.
> 
> It's hard to conclude anything based on this information alone. If
> it's truly expensive to merge them, then it warrants the additional
> complexity. But I don't understand the desire to share a design for
> two systems with sufficiently different constraints.
> 
> > Upsides?
> > 
> > 1) slab utilization might increase a little bit (but I doubt it will have
> >    a huge effect, because both merging sets should be relatively big and well
> >    utilized)
> 
> Right.
> 
> > 2) eliminate memcg kmem_cache dynamic creation/destruction. it's nice,
> >    but there isn't so much code left anyway.
> 
> There is a lot of complexity associated with the cache cloning that
> isn't the lines of code, but the lifetime and synchronization rules.

Quite opposite: the patchset removes all the complexity (or 90% of it),
because it makes the kmem_cache lifetime independent from any cgroup stuff.

Kmem_caches are created on demand on the first request (most likely during
the system start-up), and destroyed together with their root counterparts
(most likely never or on rmmod). First request means globally first request,
not a first request from a given memcg.

Memcg kmem_cache lifecycle has nothing to do with memory/obj_cgroups, and
after creation just matches the lifetime of the root kmem caches.

The only reason to keep the async creation is that some kmem_caches
are created very early in the boot process, long before any cgroup
stuff is initialized.

> 
> And these two things are the primary aspects that make my head hurt
> trying to review this patch series.
> 
> > So IMO it's an interesting direction to explore, but not something
> > that necessarily has to be done in the context of this patchset.
> 
> I disagree. Instead of replacing the old coherent model and its
> complexities with a new coherent one, you are mixing the two. And I
> can barely understand the end result.
> 
> Dynamically cloning entire slab caches for the sole purpose of telling
> whether the pages have an obj_cgroup array or not is *completely
> insane*. If the controller had followed the obj_cgroup design from the
> start, nobody would have ever thought about doing it like this.

It's just not true. The whole point of having root- and memcg sets is
to be able to not look for a NULL pointer in the obj_cgroup vector on
releasing of the root object. In other words, it allows to keep zero
overhead for root allocations. IMHO it's an important thing, and calling
it *completely insane* isn't the best way to communicate.

> 
> From a maintainability POV, we cannot afford merging it in this form.

It sounds strange: the patchset eliminates 90% of the complexity,
but it's unmergeable because there are 10% left.

I agree that it's an arguable question if we can tolerate some
additional overhead on root allocations to eliminate these additional
10%, but I really don't think it's so obvious that even discussing
it is insane.

Btw, there is another good idea to explore (also suggested by Christopher
Lameter): we can put memcg/objcg pointer into the slab page, avoiding
an extra allocation.