linux-kernel - Re: kernfs memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YnBLge4ZQNbbxufc@blackbook>
Date:   Mon, 2 May 2022 23:22:09 +0200
From:   Michal Koutný <mkoutny@...e.com>
To:     Vasily Averin <vvs@...nvz.org>
Cc:     Roman Gushchin <roman.gushchin@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeelb@...gle.com>, kernel@...nvz.org,
        Florian Westphal <fw@...len.de>, linux-kernel@...r.kernel.org,
        Michal Hocko <mhocko@...e.com>, cgroups@...r.kernel.org,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: kernfs memcg accounting

Hello.

On Mon, May 02, 2022 at 10:37:49PM +0300, Vasily Averin <vvs@...nvz.org> wrote:
> I did not understand your statement. Could you please explain it in more details?

Sure, let me expand my perhaps ambiguous and indefinite sentence.

> I see that cgroup_mkdir()->cgroup_create() creates new kernfs node for new
> sub-directory, and with my patch account memory of kernfs node to memcg 
> of current process.

Indeed. The variants I'm comparing here are: a) charge to the creator's
memcg, b) charge to the parent (memcg ancestor) of created cgroup.

When struct mem_cgroup charging was introduced, there was a similar
discussion [1].

I can see following aspects here:
1) absolute size of kernfs_objects,
2) practical difference between a) and b),
3) consistency with memcg,
4) v1 vs v2 behavior.

Ad 1) -- normally, I'd treat this as negligible (~120B struct
kernfs_node * there are ~10 of them per subsys * ~10 subsystems ~ 12
KB/cgroup). But I guess the point of this change are exploitative users
where this doesn't hold [2], so absolute size is not so important.

Ad 2) -- in the typical workloads, only top-level cgroup are created by
some management entity and lower level are managed from within, i.e.
there is little difference whom to charge the created objects.

Ad 3) -- struct mem_cgroup objects are charged to their hierarchical
parent, so that dying memcgs can be associated to a subtree which is
where the reclaim can deal with it (in contrast with creator's cgroup).

Now, if I'm looking correctly, the kernfs_node objects are not pinned by
any residual state (subsystems kill_css()->css_clear_dir() synchronously
from rmdir, cgroup itself may be RCU delayed). So the memcg argument
remains purely for consistency (but no practical reason).

Ad 4) -- the variant b) becomes slightly awkward when mkdir'ing a cgroup
in a non-memcg hierarchy (bubbles up to root, despite creator in a
non-root memcg).

How do these reasonings align with your original intention of net
devices accounting? (Are the creators of net devices inside the
container?)

> Do you think it is incorrect and new kernfs node should be accounted
> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?

I don't think either variant is incorrect. I'd very much prefer the
consistency with memcg behavior (variant a)) but as I've listed the
arguments above, it seems such a consistency can't be easily justified.

> Perhaps you mean that in this case kernfs should not be counted at all,
> as almost all neighboring allocations do?

No, I think it wouldn't help here [2]. (Or which neighboring allocations
do you mean? There must be at least nr_cgroups of them.)

(Of course, then there's the traditional performance argument, cgroup's
kernfs_node object shouldn't be problematic but I can't judge others
(sysfs) but that's nothing to prevent any form of kernfs_node accounting
going forward in my eyes.)

HTH,
Michal

[1] https://lore.kernel.org/all/20200729171039.GA22229@blackbody.suse.cz/
[2] Unless this could be constraint by something even bigger and
accounted. But only struct mem_cgroup (recursively its percpu stats)
comes to my mind.