linux-kernel - Re: [PATCH 2/2] mm: Consider subtrees in memory.events

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <20190124010306.GA9055@chrisdown.name>
Date:   Wed, 23 Jan 2019 20:03:06 -0500
From:   Chris Down <chris@...isdown.name>
To:     Roman Gushchin <guro@...com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>,
        Johannes Weiner <hannes@...xchg.org>,
        Michal Hocko <mhocko@...nel.org>, Tejun Heo <tj@...nel.org>,
        Dennis Zhou <dennis@...nel.org>,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "cgroups@...r.kernel.org" <cgroups@...r.kernel.org>,
        "linux-mm@...ck.org" <linux-mm@...ck.org>,
        Kernel Team <Kernel-team@...com>
Subject: Re: [PATCH 2/2] mm: Consider subtrees in memory.events

Roman Gushchin writes:
>On Wed, Jan 23, 2019 at 05:31:44PM -0500, Chris Down wrote:
>> memory.stat and other files already consider subtrees in their output,
>> and we should too in order to not present an inconsistent interface.
>>
>> The current situation is fairly confusing, because people interacting
>> with cgroups expect hierarchical behaviour in the vein of memory.stat,
>> cgroup.events, and other files. For example, this causes confusion when
>> debugging reclaim events under low, as currently these always read "0"
>> at non-leaf memcg nodes, which frequently causes people to misdiagnose
>> breach behaviour. The same confusion applies to other counters in this
>> file when debugging issues.
>>
>> Aggregation is done at write time instead of at read-time since these
>> counters aren't hot (unlike memory.stat which is per-page, so it does it
>> at read time), and it makes sense to bundle this with the file
>> notifications.
>
>I agree with the consistency argument (matching cgroup.events, ...),
>and it's definitely looks better for oom* events, but at the same time it feels
>like a API break.
>
>Just for example, let's say you have a delegated sub-tree with memory.max
>set. Earlier, getting memory.high/max event meant that the whole sub-tree
>is tight on memory, and, for example, led to shutdown of some parts of the tree.
>After your change, it might mean that some sub-cgroup has reached its limit,
>and probably doesn't matter on the top level.

Yeah, this is something I was thinking about while writing it. I think there's 
an argument to be made either way, since functionally they can both represent 
the same feature set, just in different ways.

In the subtree-propagated version you can find the level of the hierarchy that 
the event fired at by checking parent events vs. their subtrees' events, and 
this also allows trivially setting up event watches per-subtree.

In the previous, non-propagated version, it's more trivial to work out the 
level as the event only appears in that memory.events file, but it's harder to 
actually find out about the existence of such an event because you need to keep 
a watch for each individual cgroup in the subtree at all times.

So I think there's a reasonable argument to be made in favour of considering 
subtrees.

1. I'm not aware of anyone major currently relying on using the individual 
subtree level to indicate only subtree-level events.
2. Also, being able to detect the level at which an event happened can be 
achieved in both versions by comparing event counters.
3. Having memory.events work like cgroup.events and others seems to fit with 
principle of least astonishment.

That said, I agree that there's a tradeoff here, but in my experience this 
behaviour more closely resembles user intuition and better matches the overall 
semantics around hierarchical behaviour we've generally established for cgroup 
v2.

>Maybe it's still ok, but we definitely need to document it better. It feels
>bad that different versions of the kernel will handle it differently, so
>the userspace has to workaround it to actually use these events.

That's perfectly reasonable. I'll update the documentation to match.

>Also, please, make sure that it doesn't break memcg kselftests.

For sure.

>We don't have memory.events file for the root cgroup, so we can stop earlier.

Oh yeah, I missed that when changing from a for loop to do/while. I'll fix that 
up, thanks.

Thanks for your feedback!