linux-kernel - Re: kernfs memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <1aa4cd22-fcb6-0e8d-a1c6-23661d618864@openvz.org>
Date:   Fri, 6 May 2022 11:37:40 +0300
From:   Vasily Averin <vvs@...nvz.org>
To:     Michal Koutný <mkoutny@...e.com>
Cc:     Roman Gushchin <roman.gushchin@...ux.dev>,
        Vlastimil Babka <vbabka@...e.cz>,
        Shakeel Butt <shakeelb@...gle.com>, kernel@...nvz.org,
        Florian Westphal <fw@...len.de>, linux-kernel@...r.kernel.org,
        Michal Hocko <mhocko@...e.com>, cgroups@...r.kernel.org,
        Greg Kroah-Hartman <gregkh@...uxfoundation.org>,
        Tejun Heo <tj@...nel.org>
Subject: Re: kernfs memcg accounting

On 5/5/22 12:47, Michal Koutný wrote:
> On Thu, May 05, 2022 at 12:16:12AM +0300, Vasily Averin <vvs@...nvz.org> wrote:
>> I think it should allocate at least 2 pages.
> 
> After decoding kmalloc_type(), I agree this falls into a global
> (unaccouted) kmalloc_cache.
> 
>> However if cgroup_mkdir() calls mem_cgroup_alloc() it correctly account huge percpu
>> allocations but ignores neighbour multipage allocation.
> 
> So, the spillover is bound and proportional to memcg limit (same ration
> like these two sizes).
> But it may be better to account it properly, especially if it's
> contribution form an offlined mem_cgroup.

I've traced mkdir /sys/fs/cgroup/vvs.test on 4cpu VM with Fedora 
and self-complied upstream kernel, see table with results below.
These calculations are not precise, it depends on kernel config options,
number of cpus, enabled controllers, ignores possible page allocations etc
However I think this is enough to clarify the general situation.

Results:
- Total sum of accounted memory is ~60Kb.
- Accounted only 2 huge percpu allocation marked '=', ~18Kb.
  (and can be 0 without memory controller)
- kernfs nodes and iattrs are among the main memory consumers.
   they are marked '++' to be accounted first
- cgroup_mkdir always allocates 4Kb,
   so I think it should be accounted first too.
- mem_cgroup_css_alloc allocations consumes 10K,
   it's enough to be accounted, especially for VMs with 1-2 CPUs
- Almost all other allocations are quite small and can be ignored.
  Exceptions are percpu allocations in alloc_fair_sched_group(),
   this can consume a significant amount of memory on nodes
   with multiple processors.
   marked by '+', can be accounted later.
- kernfs nodes consumes ~6Kb memory inside simple_xattr_set() 
   and simple_xattr_alloc(). This is quite high numbers,
   but is not critical, and I think we can ignore it at the moment.
- If all proposed memory will be accounted it gives us ~47Kb, 
   or ~75% of all allocated memory.

Any comments are welcome.

Thank you,
	Vasily Averin

number	bytes	$1*$2	sum	note	call_site
of	alloc
allocs
------------------------------------------------------------
1       14448   14448   14448   =       percpu_alloc_percpu:
1       8192    8192    22640   ++      (mem_cgroup_css_alloc+0x54)
49      128     6272    28912   ++      (__kernfs_new_node+0x4e)
49      96      4704    33616   ?       (simple_xattr_alloc+0x2c)
49      88      4312    37928   ++      (__kernfs_iattrs+0x56)
1       4096    4096    42024   ++      (cgroup_mkdir+0xc7)
1       3840    3840    45864   =       percpu_alloc_percpu:
4       512     2048    47912   +       (alloc_fair_sched_group+0x166)
4       512     2048    49960   +       (alloc_fair_sched_group+0x139)
1       2048    2048    52008   ++      (mem_cgroup_css_alloc+0x109)
49      32      1568    53576   ?       (simple_xattr_set+0x5b)
2       584     1168    54744           (radix_tree_node_alloc.constprop.0+0x8d)
1       1024    1024    55768           (cpuset_css_alloc+0x30)
1       1024    1024    56792           (alloc_shrinker_info+0x79)
1       768     768     57560           percpu_alloc_percpu:
1       640     640     58200           (sched_create_group+0x1c)
33      16      528     58728           (__kernfs_new_node+0x31)
1       512     512     59240           (pids_css_alloc+0x1b)
1       512     512     59752           (blkcg_css_alloc+0x39)
9       48      432     60184           percpu_alloc_percpu:
13      32      416     60600           (__kernfs_new_node+0x31)
1       384     384     60984           percpu_alloc_percpu:
1       256     256     61240           (perf_cgroup_css_alloc+0x1c)
1       192     192     61432           percpu_alloc_percpu:
1       64      64      61496           (mem_cgroup_css_alloc+0x363)
1       32      32      61528           (ioprio_alloc_cpd+0x39)
1       32      32      61560           (ioc_cpd_alloc+0x39)
1       32      32      61592           (blkcg_css_alloc+0x6b)
1       32      32      61624           (alloc_fair_sched_group+0x52)
1       32      32      61656           (alloc_fair_sched_group+0x2e)
3       8       24      61680           (__kernfs_new_node+0x31)
3       8       24      61704           (alloc_cpumask_var_node+0x1b)
1       24      24      61728           percpu_alloc_percpu: