linux-kernel - Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by mkdir cgroup

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ef9f7516-853d-ffe4-9a7a-5e87556bdbbe@openvz.org>
Date:   Mon, 30 May 2022 16:09:00 +0300
From:   Vasily Averin <vvs@...nvz.org>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, kernel@...nvz.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Shakeel Butt <shakeelb@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Michal Koutný <mkoutny@...e.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Muchun Song <songmuchun@...edance.com>, cgroups@...r.kernel.org
Subject: Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by
 mkdir cgroup

On 5/30/22 14:55, Michal Hocko wrote:
> On Mon 30-05-22 14:25:45, Vasily Averin wrote:
>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
>> are not precise, it depends on kernel config options, number of cpus,
>> enabled controllers, ignores possible page allocations etc.
>> However this is enough to clarify the general situation.
>> All allocations are splited into:
>> - common part, always called for each cgroup type
>> - per-cgroup allocations
>>
>> In each group we consider 2 corner cases:
>> - usual allocations, important for 1-2 CPU nodes/Vms
>> - percpu allocations, important for 'big irons'
>>
>> common part: 	~11Kb	+  318 bytes percpu
>> memcg: 		~17Kb	+ 4692 bytes percpu
>> cpu:		~2.5Kb	+ 1036 bytes percpu
>> cpuset:		~3Kb	+   12 bytes percpu
>> blkcg:		~3Kb	+   12 bytes percpu
>> pid:		~1.5Kb	+   12 bytes percpu		
>> perf:		 ~320b	+   60 bytes percpu
>> -------------------------------------------
>> total:		~38Kb	+ 6142 bytes percpu
>> currently accounted:	  4668 bytes percpu
>>
>> - it's important to account usual allocations called
>> in common part, because almost all of cgroup-specific allocations
>> are small. One exception here is memory cgroup, it allocates a few
>> huge objects that should be accounted.
>> - Percpu allocation called in common part, in memcg and cpu cgroups
>> should be accounted, rest ones are small an can be ignored.
>> - KERNFS objects are allocated both in common part and in most of
>> cgroups 
>>
>> Details can be found here:
>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
>>
>> I checked other cgroups types was found that they all can be ignored.
>> Additionally I found allocation of struct rt_rq called in cpu cgroup 
>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
>> percpu structure and should be accounted too.
> 
> One thing that the changelog is missing is an explanation why do we need
> to account those objects. Users are usually not empowered to create
> cgroups arbitrarily. Or at least they shouldn't because we can expect
> more problems to happen.
> 
> Could you clarify this please?

The problem is actual for OS-level containers: LXC or OpenVz.
They are widely used for hosting and allow to run containers
by untrusted end-users. Root inside such containers is able
to create groups inside own container and consume host memory
without its proper accounting.

Thank you,
	Vasily Averin