linux-kernel - Re: kernfs memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Ynv7+VG+T2y9rpdk@carbon>
Date:   Wed, 11 May 2022 11:10:01 -0700
From:   Roman Gushchin <roman.gushchin@...ux.dev>
To:     Michal Koutný <mkoutny@...e.com>
Cc:     Vasily Averin <vvs@...nvz.org>, Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeelb@...gle.com>, kernel@...nvz.org,
        Florian Westphal <fw@...len.de>, linux-kernel@...r.kernel.org,
        Michal Hocko <mhocko@...e.com>, cgroups@...r.kernel.org,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: kernfs memcg accounting

On Wed, May 11, 2022 at 06:34:39PM +0200, Michal Koutny wrote:
> On Tue, May 10, 2022 at 08:06:24PM -0700, Roman Gushchin <roman.gushchin@...ux.dev> wrote:
> > My primary goal was to apply the memory pressure on memory cgroups with a lot
> > of (dying) children cgroups. On a multi-cpu machine a memory cgroup structure
> > is way larger than a page, so a cgroup which looks small can be really large
> > if we calculate the amount of memory taken by all children memcg internals.
> > 
> > Applying this pressure to another cgroup (e.g. the one which contains systemd)
> > doesn't help to reclaim any pages which are pinning the dying cgroups.
> 
> Just a note -- this another usecase of cgroups created from within the
> subtree (e.g. a container). I agree that cgroup-manager/systemd case is
> also valid (as dying memcgs may accumulate after a restart).
> 
> memcgs with their retained state with footprint are special.
> 
> > For other controllers (maybe blkcg aside, idk) it shouldn't matter, because
> > there is no such problem there.
> > 
> > For consistency reasons I'd suggest to charge all *large* allocations
> > (e.g. percpu) to the parent cgroup. Small allocations can be ignored.
> 
> Strictly speaking, this would mean that any controller would have on
> implicit dependency on the memory controller (such as io controller
> has).
> In the extreme case even controller-less hierarchy would have such a
> requirement (for precise kernfs_node accounting).
> Such a dependency is not enforceable on v1 (with various topologies of
> different hierarchies).
>
> Although, I initially favored the consistency with memory controller too,
> I think it's simpler to charge to the creator's memcg to achieve
> consistency across v1 and v2 :-)

Ok, v1/v2 consistency is a valid point.

As I said, I'm fine with both options, it shouldn't matter that much
for anything except the memory controller: cgroup internal objects are not
that large and the total memory footprint is usually small unless we have
a lot of (dying) sub-cgroups. From my experience no other controllers
should be affected (blkcg was affected due to a cgwb reference, but should
be fine now), so it's not an issue at all.

Thanks!