linux-kernel - Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20200227133544.GA20690@blackbody.suse.cz>
Date:   Thu, 27 Feb 2020 14:35:44 +0100
From:   Michal Koutný <mkoutny@...e.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Roman Gushchin <guro@...com>, Michal Hocko <mhocko@...e.com>,
        Tejun Heo <tj@...nel.org>, linux-mm@...ck.org,
        cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
        kernel-team@...com
Subject: Re: [PATCH v2 3/3] mm: memcontrol: recursive memory.low protection

TL;DR I see merit in the recursive propagation if it's requested
explicitly (i.e. retaining meaining of 0). The protection/weight
semantics should be refined.

On Wed, Feb 26, 2020 at 10:05:48AM -0500, Johannes Weiner <hannes@...xchg.org> wrote:
> They still ultimately translate to real resources. The concrete value
> depends on what the parent's weight translates to, and it depends on
> sibling configurations and their current consumption. (All of this is
> already true for memory protection as well, btw). But eventually, a
> weight specification translates to actual time on a CPU, bandwidth on
> an IO device etc.
> 
> > - sum of sibling weights is meaningless (and independent from parent
> >   weight)
> 
> Technically true for overcommitted memory.low values as well.
Yes, but for overcommited only. For pure weights it doesn't matter if
you set 1:10, 10:100 or 100:1000, however, for the protection it has
this behavior only when approaching infinity and as the sum compares to
parent's value, the protection behaves differently.

[If there had to be to some pure memory weights, those would for
instance express relative affinity of group's pages to physical memory.]

> I don't see a fundamental difference between them. And that in turn
> makes it hard for me to accept that hierarchical inheritance rules
> should be different.
I'll try coming up with some better examples for the difference that I
perceive.

> "Wrong" isn't the right term. Is it what you wanted to express in your
> configuration?
I want to express absolute amount of memory (ideally representing
workingset size) under protection.

IIUC, you want to express general relative priorities of B vs C when
some outer metric has to be maintained given you reach both limits of
memory and IO.

> You are talking about a mathematical truth on a per-controller
> basis. What I'm saying is that I don't see how this is useful for real
> workloads, their relative priorities, and the performance expectations
> users have from these priorities.
 
> With a priority inversion like this, there is no actual performance
> isolation or containerization going on here - which is the whole point
> of cgroups and resource control.
I acknowledge that by pressing too much along one dimension (memory) you
induce expansion in other dimension (IO) and that may become noticable in
siblings (expansion over saturation [1]). But that's expected when only
weights are in use. If you wanted to hide the effect of workload B to C,
B would need real limit.

[I beg to disagree that containerization is whole point of cgroups, it's
large part of it, hence the isolation needn't be necessarily
bi-directional.]

> My objection is to opting out of protection against cousins (thus
> overriding parental resource assignment), not against siblings.
Just to sync up the terminology - I'm calling this protection against
uncles (the composition/structure under them is irrelevant).
And the limitation comes from grandparent or higher (or global).

...and the overriden parental resource assignment is the expansion on
non-memory dimension (IO/CPU).

> Correct, but you can change the tree to this:
> 
>      A.low=10G
>      `- A1.low=10G
>         `- B.low=0G
>         `- C.low=0G
>      `- D.low=0G
> 
> to express
> 
> A1 > D
>  B = C
That sort of works (if I give up the scapegoat). Although I have trouble
that I have to copy the value from A to A1, I could have done that with
previous hierarchy and simply set B.low=C.low=10G.

> That is, I would like to see an argument for this setup:
> 
>      A				
>      `- B		io.weight=200          memory.low=10G
>         `- D		io.weight=100 (e.g.)   memory.low=10G
>         `- E		io.weight=100 (e.g.)   memory.low=0
>      `- C		io.weight=50           memory.low=5G
> 
> Where E has no memory protection against C, but E has IO priority over
> C. That's the configuration that cannot be expressed with a recursive
> memory.low, but since it involves priority inversions it's not useful
> to actually isolate and containerize workloads.
But there can be no cousin (uncle) or more precisely it's the global
rest that we don't mind to affect.

> > I'd say that protected memory is a disposable resource in contrast with
> > CPU/IO. If you don't have latter, you don't progress; if you lack the
> > former, you are refaulting but can make progress. Even more, you should
> > be able to give up memory.min.
> 
> Eh, I'm not buying that. You cannot run without memory either. If
> somebody reclaims a page between you faulting it in and you resuming
> to userspace, there is no forward progress.
I made a hasty argument (misinterpretting the constant outer reclaim
pressure). So that wasn't the fundamental difference.

The second part -- memory.min is subject to equal calculation as
memory.low. Do you find the scape goat preventing OOM in grand-parent
(or higher) subtree also a misfeature/artifact?

Thanks,
Michal

[1] Here I'm taking your/Tejun's assumption that in the stressful
situations it always boils down to IO, although I don't have any
quantitative arguments for that.

Download attachment "signature.asc" of type "application/pgp-signature" (834 bytes)