lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Date:   Mon, 30 May 2022 22:58:30 +0300
From:   Vasily Averin <vvs@...nvz.org>
To:     Michal Hocko <mhocko@...e.com>
Cc:     Andrew Morton <akpm@...ux-foundation.org>, kernel@...nvz.org,
        linux-kernel@...r.kernel.org, linux-mm@...ck.org,
        Shakeel Butt <shakeelb@...gle.com>,
        Roman Gushchin <roman.gushchin@...ux.dev>,
        Michal Koutný <mkoutny@...e.com>,
        Vlastimil Babka <vbabka@...e.cz>,
        Muchun Song <songmuchun@...edance.com>, cgroups@...r.kernel.org
Subject: Re: [PATCH mm v3 0/9] memcg: accounting for objects allocated by
 mkdir cgroup

On 5/30/22 17:22, Michal Hocko wrote:
> On Mon 30-05-22 16:09:00, Vasily Averin wrote:
>> On 5/30/22 14:55, Michal Hocko wrote:
>>> On Mon 30-05-22 14:25:45, Vasily Averin wrote:
>>>> Below is tracing results of mkdir /sys/fs/cgroup/vvs.test on 
>>>> 4cpu VM with Fedora and self-complied upstream kernel. The calculations
>>>> are not precise, it depends on kernel config options, number of cpus,
>>>> enabled controllers, ignores possible page allocations etc.
>>>> However this is enough to clarify the general situation.
>>>> All allocations are splited into:
>>>> - common part, always called for each cgroup type
>>>> - per-cgroup allocations
>>>>
>>>> In each group we consider 2 corner cases:
>>>> - usual allocations, important for 1-2 CPU nodes/Vms
>>>> - percpu allocations, important for 'big irons'
>>>>
>>>> common part: 	~11Kb	+  318 bytes percpu
>>>> memcg: 		~17Kb	+ 4692 bytes percpu
>>>> cpu:		~2.5Kb	+ 1036 bytes percpu
>>>> cpuset:		~3Kb	+   12 bytes percpu
>>>> blkcg:		~3Kb	+   12 bytes percpu
>>>> pid:		~1.5Kb	+   12 bytes percpu		
>>>> perf:		 ~320b	+   60 bytes percpu
>>>> -------------------------------------------
>>>> total:		~38Kb	+ 6142 bytes percpu
>>>> currently accounted:	  4668 bytes percpu
>>>>
>>>> - it's important to account usual allocations called
>>>> in common part, because almost all of cgroup-specific allocations
>>>> are small. One exception here is memory cgroup, it allocates a few
>>>> huge objects that should be accounted.
>>>> - Percpu allocation called in common part, in memcg and cpu cgroups
>>>> should be accounted, rest ones are small an can be ignored.
>>>> - KERNFS objects are allocated both in common part and in most of
>>>> cgroups 
>>>>
>>>> Details can be found here:
>>>> https://lore.kernel.org/all/d28233ee-bccb-7bc3-c2ec-461fd7f95e6a@openvz.org/
>>>>
>>>> I checked other cgroups types was found that they all can be ignored.
>>>> Additionally I found allocation of struct rt_rq called in cpu cgroup 
>>>> if CONFIG_RT_GROUP_SCHED was enabled, it allocates huge (~1700 bytes)
>>>> percpu structure and should be accounted too.
>>>
>>> One thing that the changelog is missing is an explanation why do we need
>>> to account those objects. Users are usually not empowered to create
>>> cgroups arbitrarily. Or at least they shouldn't because we can expect
>>> more problems to happen.
>>>
>>> Could you clarify this please?
>>
>> The problem is actual for OS-level containers: LXC or OpenVz.
>> They are widely used for hosting and allow to run containers
>> by untrusted end-users. Root inside such containers is able
>> to create groups inside own container and consume host memory
>> without its proper accounting.
> 
> Is the unaccounted memory really the biggest problem here?
> IIRC having really huge cgroup trees can hurt quite some controllers.
> E.g. how does the cpu controller deal with too many or too deep
> hierarchies?

Could you please describe it in more details?
Maybe it was passed me by, maybe I messed or forgot something,
however I cannot remember any other practical cgroup-related issues.

Maybe deep hierarchies does not work well.
however, I have not heard that the internal configuration of cgroup
can affect the upper level too.

Please let me know if this can happen, this is very interesting for us.

In our case, the hoster configures only the top level of the cgroup
and does not worry about possible misconfiguration inside containers
if it does not affect other containers or the host itself.

Unaccounted memory, contrary, can affects both neighbor containers and host system,
we saw it many times, and therefore we pay special attention to such issues.

Thank you,
	Vasily Averin

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ