linux-kernel - Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the parent cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20200811193228.GC1507044@carbon.DHCP.thefacebook.com>
Date:   Tue, 11 Aug 2020 12:32:28 -0700
From:   Roman Gushchin <guro@...com>
To:     Michal Koutný <mkoutny@...e.com>
CC:     Andrew Morton <akpm@...ux-foundation.org>,
        Dennis Zhou <dennis@...nel.org>, Tejun Heo <tj@...nel.org>,
        Christoph Lameter <cl@...ux.com>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>,
        Shakeel Butt <shakeelb@...gle.com>, <linux-mm@...ck.org>,
        <kernel-team@...com>, <linux-kernel@...r.kernel.org>
Subject: Re: [PATCH v3 4/5] mm: memcg: charge memcg percpu memory to the
 parent cgroup

On Tue, Aug 11, 2020 at 08:32:25PM +0200, Michal Koutny wrote:
> On Tue, Aug 11, 2020 at 09:55:27AM -0700, Roman Gushchin <guro@...com> wrote:
> > As I said, there are 2 problems with charging systemd (or a similar daemon):
> > 1) It often belongs to the root cgroup.
> This doesn't hold for systemd (if we agree that systemd is the most
> common case).

Ok, it's better.

> 
> > 2) OOMing or failing some random memory allocations is a bad way
> >    to "communicate" a memory shortage to the daemon.
> >    What we really want is to prevent creating a huge number of cgroups
> There's cgroup.max.descendants for that...

cgroup.max.descendants limits the number of live cgroups, it can't limit
the number of dying cgroups.

> 
> >    (including dying cgroups) in some specific sub-tree(s).
> ...oh, so is this limiting the number of cgroups or limiting resources
> used?

My scenario is simple: there is a large machine, which has no memory
pressure for some time (e.g. is idle or running a workload with small
working set). Periodically running services creating a lot of cgroups,
usually in system.slice. After some time a significant part of the whole
memory is getting consumed by dying cgroups and their percpu data.
Getting rid of it and reclaiming all memory is not always possible
(percpu is getting fragmented relatively easy) and is time consuming.

If we'll set memory.high on system.slice, it will create an artificial
memory pressure once we're getting close to the limit. It will trigger
the reclaim of user pages and slab objects, so eventually we'll be able
to release dying cgroups as well.

You might say that it would work even without charging memcg internal
structures. The problem is that a small slab object can indirectly pin
a lot of (percpu) memory. If don't take the indirectly pinned memory
into account, likely we won't apply enough memory pressure.

If we'll limit init.slice (where systemd seems to reside), as you suggest,
we'll eventually create trashing in init.slice, followed by OOM.
I struggle to see how it makes the life of a user better?

> 
> >    OOMing the daemon or returning -ENOMEM to some random syscalls
> >    will not help us to reach the goal and likely will bring a bad
> >    experience to a user.
> If we reach the situation when memory for cgroup operations is tight,
> it'll disappoint the user either way.
> My premise is that a running workload is more valuable than the
> accompanying manager.

The problem is that OOM-killing the accompanying manager won't release
resources and help to get rid of accumulated cgroups. So in the very
best case it will prevent new cgroups from being created (as well
as some other random operations from being performed). Most likely
the only way to "fix" this for a user will be to reboot the machine.

> 
> > In a generic case I don't see how we can charge the cgroup which
> > creates cgroups without solving these problems first.
> In my understanding, "onbehalveness" is a concept useful for various
> kernel threads doing deferred work. Here it's promoted to user processes
> managing cgroups.
> 
> > And if there is a very special case where we have to limit it,
> > we can just add an additional layer:
> > 
> > ` root or delegated root
> >    ` manager-parent-cgroup-with-a-limit
> >      ` manager-cgroup (systemd, docker, ...)
> >    ` [aggregation group(s)]
> >      ` job-group-1
> >      ` ...
> >      ` job-group-n
> If the charge goes to the parent of created cgroup (job-cgroup-i here),
> then the layer adds nothing. Am I missing something?

Sorry, I was wrong here, please ignore this part.

> 
> > I'd definitely charge the parent cgroup in all similar cases.
> (This would mandate the controllers on the unified hierarchy, which is
> fine IMO.) Then the order of enabling controllers on a subtree (e.g.
> cpu,memory vs memory,cpu) by the manager would yield different charging.
> This seems wrong^W confusing to me.

I agree it's confusing.

Thanks!