linux-kernel - Re: cgroup: Clarification around usage_in_bytes and its relation to the page counter

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <p37th7kmkn3njp6nuu22qi5vnse3mdhlqen4wlk3ps26bdaujd@prgdu3vtm47y>
Date:   Tue, 2 May 2023 19:44:57 +0200
From:   Michal Koutný <mkoutny@...e.com>
To:     Michael Honaker <mchonaker@...il.com>
Cc:     cgroups@...r.kernel.org, linux-mm@...ck.org,
        linux-kernel@...r.kernel.org
Subject: Re: cgroup: Clarification around usage_in_bytes and its relation to
 the page counter

Hello Michael.

On Wed, Apr 12, 2023 at 09:22:07PM -0400, Michael Honaker <mchonaker@...il.com> wrote:
> I have been trying to get an accurate measurement of memory usage of a
> non-root cgroup, specifically a Kubernetes container,

Beware that containers are more or less based on sharing resources,
shared accounting is difficult and hence _accurate_ measurement may not
be available or the numbers need some amount of interpretation.

> and noticed some inconsistencies when comparing the value of
> `memory.usage_in_bytes` with the information in `memory.stat`. After
> further investigation of the cgroup docs
> (/admin-guide/cgroups/memory.rst#usage_in_bytes) and an old LMKL
> thread ("real meaning of memory.usage_in_bytes"),

[OT: I suggest you move to cgroup v2, the entities are IMO better named
and it's also more futureproof ;-)]

> I came to the understanding that `usage_in_bytes` actually shows the
> value of the resource counter which is an overestimation due to the
> counter being split into per-cpu chunks for caching,

I didn't read the thread but it's true that per-cpu batching may result
in an error (both signs in theory). Since around v5.13 the
implementation changed and error should be better:
O(nr_cpus * nr_cgroups(subtree) * MEMCG_CHARGE_BATCH) -> O(nr_cpus * MEMCG_CHARGE_BATCH).

> and that the real usage can be calculated from RSS+Cache gathered from
> `memory.stat`.  I've created cadvisor issue #3286
> (https://github.com/google/cadvisor/issues/3286) which goes into
> greater detail on my investigation with examples.

The difference that you spot there is not caused (merely) by the per-cpu
optimization.
What you see as the difference is mainly kernel memory (e.g. dentries,
inodes, task_struct,...) -- RSS+Cache would only show memory that
userspace is directly responsible for but not the kernel structures
(whose size depends on kernel implementation afterall).

(On v2, you could see breakdown of the kernel memory usage besides
others in memory.stat.)

> Is the above understanding still correct with the new page counters?
> If so, could any memory allocations be reflected in `usage_in_bytes`
> but not in `stat` for child cgroups? I want to ensure I'm not
> missing anything by only monitoring the `stat` file.

I hope the abve sheds some light on these questions.

Michal

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)