linux-kernel - Re: [PATCH 0/2] hugetlb memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAPTztWY8eDSa1qKx35hTm5ef+e13SDnRHDrevc-1V1v7-pEP3w@mail.gmail.com>
Date:   Tue, 26 Sep 2023 13:50:10 -0700
From:   Frank van der Linden <fvdl@...gle.com>
To:     Nhat Pham <nphamcs@...il.com>
Cc:     akpm@...ux-foundation.org, riel@...riel.com, hannes@...xchg.org,
        mhocko@...nel.org, roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, tj@...nel.org, lizefan.x@...edance.com,
        shuah@...nel.org, mike.kravetz@...cle.com, yosryahmed@...gle.com,
        linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Tue, Sep 26, 2023 at 12:49 PM Nhat Pham <nphamcs@...il.com> wrote:
>
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
>
> This patch series rectifies this issue by charging the memcg when the
> hugetlb folio is allocated, and uncharging when the folio is freed. In
> addition, a new selftest is added to demonstrate and verify this new
> behavior.
>
> Nhat Pham (2):
>   hugetlb: memcg: account hugetlb-backed memory in memory controller
>   selftests: add a selftest to verify hugetlb usage in memcg
>
>  MAINTAINERS                                   |   2 +
>  fs/hugetlbfs/inode.c                          |   2 +-
>  include/linux/hugetlb.h                       |   6 +-
>  include/linux/memcontrol.h                    |   8 +
>  mm/hugetlb.c                                  |  23 +-
>  mm/memcontrol.c                               |  40 ++++
>  tools/testing/selftests/cgroup/.gitignore     |   1 +
>  tools/testing/selftests/cgroup/Makefile       |   2 +
>  .../selftests/cgroup/test_hugetlb_memcg.c     | 222 ++++++++++++++++++
>  9 files changed, 297 insertions(+), 9 deletions(-)
>  create mode 100644 tools/testing/selftests/cgroup/test_hugetlb_memcg.c
>
> --
> 2.34.1
>

We've had this behavior at Google for a long time, and we're actually
getting rid of it. hugetlb pages are a precious resource that should
be accounted for separately. They are not just any memory, they are
physically contiguous memory, charging them the same as any other
region of the same size ended up not making sense, especially not for
larger hugetlb page sizes.

Additionally, if this behavior is changed just like that, there will
be quite a few workloads that will break badly because they'll hit
their limits immediately - imagine a container that uses 1G hugetlb
pages to back something large (a database, a VM), and 'plain' memory
for control processes.

What do your workloads do? Is it not possible for you to account for
hugetlb pages separately? Sure, it can be annoying to have to deal
with 2 separate totals that you need to take into account, but again,
hugetlb pages are a resource that is best dealt with separately.

- Frank