linux-kernel - Re: kernfs memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <52a9f35b-458b-44c4-7fc8-d05c8db0c73f@openvz.org>
Date:   Wed, 4 May 2022 12:00:18 +0300
From:   Vasily Averin <vvs@...nvz.org>
To:     Michal Koutný <mkoutny@...e.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>
Cc:     Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeelb@...gle.com>, kernel@...nvz.org,
        Florian Westphal <fw@...len.de>, linux-kernel@...r.kernel.org,
        Michal Hocko <mhocko@...e.com>, cgroups@...r.kernel.org,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: kernfs memcg accounting

On 5/3/22 00:22, Michal Koutný wrote:
> When struct mem_cgroup charging was introduced, there was a similar
> discussion [1].

Thank you, I'm missed this patch, it was very interesting and useful.
I would note though, that OpenVZ and LXC have another usecase:
we have separate and independent systemd instances inside OS containers.
So container's cgroups are created not in host's root memcg but 
inside accountable container's root memcg.  

> I can see following aspects here:
> 1) absolute size of kernfs_objects,
> 2) practical difference between a) and b),
> 3) consistency with memcg,
> 4) v1 vs v2 behavior.
...
> How do these reasonings align with your original intention of net
> devices accounting? (Are the creators of net devices inside the
> container?)

It is possible to create netdevice in one namespace/container 
and then move them to another one, and this possibility is widely used.
With my patch memory allocated by these devices will be not accounted
to new memcg, however I do not think it is a problem.
My patches protect the host mostly from misuse, when someone creates
a huge number of nedevices inside a container.

>> Do you think it is incorrect and new kernfs node should be accounted
>> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?
> 
> I don't think either variant is incorrect. I'd very much prefer the
> consistency with memcg behavior (variant a)) but as I've listed the
> arguments above, it seems such a consistency can't be easily justified.

>From my point of view it is most important to account allocated memory
to any cgroup inside container. Select of proper memcg is a secondary goal here.
Frankly speaking I do not see a big difference between memcg of current process,
memcg of newly created child and memcg of its parent.

As far as I understand, Roman chose the parent memcg because it was a special
case of creating a new memory group. He temporally changed active memcg
in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
allocations.
However, he ignored accounting for a rather large struct mem_cgroup
therefore I think we can do not worry about 128 bytes of kernfs node.
Yes, it will be accounted to some other memcg, but the same thing
happens with kernfs nodes of other groups.
I don't think that's a problem.

>> Perhaps you mean that in this case kernfs should not be counted at all,
>> as almost all neighboring allocations do?
> 
> No, I think it wouldn't help here [2]. (Or which neighboring allocations
> do you mean? There must be at least nr_cgroups of them.)

Primary I mean here struct mem_cgroup allocation in mem_cgroup_alloc().
However, I think we need to take into account any other distributions called
inside cgroup_mkdir: struct cgroup and kernefs node in common part and 
any other cgroup-cpecific allocations in other .css_alloc functions.
They all can be called from inside container, allocates non-accountable
memory and by this way theoretically can be misused.
So I'm going to check this scenario a bit later.

Thank you,
	Vasily Averin