linux-kernel - Re: kernfs memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <YnsoMEuWjlpDcmt3@carbon>
Date:   Tue, 10 May 2022 20:06:24 -0700
From:   Roman Gushchin <roman.gushchin@...ux.dev>
To:     Vasily Averin <vvs@...nvz.org>
Cc:     Michal Koutný <mkoutny@...e.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeelb@...gle.com>, kernel@...nvz.org,
        Florian Westphal <fw@...len.de>, linux-kernel@...r.kernel.org,
        Michal Hocko <mhocko@...e.com>, cgroups@...r.kernel.org,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: kernfs memcg accounting

On Wed, May 04, 2022 at 12:00:18PM +0300, Vasily Averin wrote:
> On 5/3/22 00:22, Michal Koutný wrote:
> > When struct mem_cgroup charging was introduced, there was a similar
> > discussion [1].
> 
> Thank you, I'm missed this patch, it was very interesting and useful.
> I would note though, that OpenVZ and LXC have another usecase:
> we have separate and independent systemd instances inside OS containers.
> So container's cgroups are created not in host's root memcg but 
> inside accountable container's root memcg.  
> 
> > I can see following aspects here:
> > 1) absolute size of kernfs_objects,
> > 2) practical difference between a) and b),
> > 3) consistency with memcg,
> > 4) v1 vs v2 behavior.
> ...
> > How do these reasonings align with your original intention of net
> > devices accounting? (Are the creators of net devices inside the
> > container?)
> 
> It is possible to create netdevice in one namespace/container 
> and then move them to another one, and this possibility is widely used.
> With my patch memory allocated by these devices will be not accounted
> to new memcg, however I do not think it is a problem.
> My patches protect the host mostly from misuse, when someone creates
> a huge number of nedevices inside a container.
> 
> >> Do you think it is incorrect and new kernfs node should be accounted
> >> to memcg of parent cgroup, as mem_cgroup_css_alloc()-> mem_cgroup_alloc() does?
> > 
> > I don't think either variant is incorrect. I'd very much prefer the
> > consistency with memcg behavior (variant a)) but as I've listed the
> > arguments above, it seems such a consistency can't be easily justified.
> 
> From my point of view it is most important to account allocated memory
> to any cgroup inside container. Select of proper memcg is a secondary goal here.
> Frankly speaking I do not see a big difference between memcg of current process,
> memcg of newly created child and memcg of its parent.
> 
> As far as I understand, Roman chose the parent memcg because it was a special
> case of creating a new memory group. He temporally changed active memcg
> in mem_cgroup_css_alloc() and properly accounted all required memcg-specific
> allocations.

My primary goal was to apply the memory pressure on memory cgroups with a lot
of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
is way larger than a page, so a cgroup which looks small can be really large
if we calculate the amount of memory taken by all children memcg internals.

Applying this pressure to another cgroup (e.g. the one which contains systemd)
doesn't help to reclaim any pages which are pinning the dying cgroups.

For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
there is no such problem there.

For consistency reasons I'd suggest to charge all *large* allocations
(e.g. percpu) to the parent cgroup. Small allocations can be ignored.

Thanks!