linux-kernel - Re: [PATCH 0/2] hugetlb memcg accounting

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=Ocm_Zn=3P0gBdJKSwyqWq3fX37OEGAjCA5vKgJb+QGvw@mail.gmail.com>
Date:   Wed, 27 Sep 2023 10:22:54 -0700
From:   Nhat Pham <nphamcs@...il.com>
To:     Johannes Weiner <hannes@...xchg.org>
Cc:     Michal Hocko <mhocko@...e.com>,
        Frank van der Linden <fvdl@...gle.com>,
        akpm@...ux-foundation.org, riel@...riel.com,
        roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, tj@...nel.org, lizefan.x@...edance.com,
        shuah@...nel.org, mike.kravetz@...cle.com, yosryahmed@...gle.com,
        linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH 0/2] hugetlb memcg accounting

On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> > [...]
> > > The fact that memory consumed by hugetlb is currently not considered
> > > inside memcg (host memory accounting and control) is inconsistent. It
> > > has been quite confusing to our service owners and complicating things
> > > for our containers team.
> >
> > I do understand how that is confusing and inconsistent as well. Hugetlb
> > is bringing throughout its existence I am afraid.
> >
> > As noted in other reply though I am not sure hugeltb pool can be
> > reasonably incorporated with a sane semantic. Neither of the regular
> > allocation nor the hugetlb reservation/actual use can fallback to the
> > pool of the other. This makes them 2 different things each hitting their
> > own failure cases that require a dedicated handling.
> >
> > Just from top of my head these are cases I do not see easy way out from:
> >       - hugetlb charge failure has two failure modes - pool empty
> >         or memcg limit reached. The former is not recoverable and
> >         should fail without any further intervention the latter might
> >         benefit from reclaiming.
> >       - !hugetlb memory charge failure cannot consider any hugetlb
> >         pages - they are implicit memory.min protection so it is
> >           impossible to manage reclaim protection without having a
> >           knowledge of the hugetlb use.
> >       - there is no way to control the hugetlb pool distribution by
> >         memcg limits. How do we distinguish reservations from actual
> >         use?
> >       - pre-allocated pool is consuming memory without any actual
> >         owner until it is actually used and even that has two stages
> >         (reserved and really used). This makes it really hard to
> >         manage memory as whole when there is a considerable amount of
> >         hugetlb memore preallocated.
>
> It's important to distinguish hugetlb access policy from memory use
> policy. This patch isn't about hugetlb access, it's about general
> memory use.
>
> Hugetlb access policy is a separate domain with separate
> answers. Preallocating is a privileged operation, for access control
> there is the hugetlb cgroup controller etc.
>
> What's missing is that once you get past the access restrictions and
> legitimately get your hands on huge pages, that memory use gets
> reflected in memory.current and exerts pressure on *other* memory
> inside the group, such as anon or optimistic cache allocations.
>
> Note that hugetlb *can* be allocated on demand. It's unexpected that
> when an application optimistically allocates a couple of 2M hugetlb
> pages those aren't reflected in its memory.current. The same is true
> for hugetlb_cma. If the gigantic pages aren't currently allocated to a
> cgroup, that CMA memory can be used for movable memory elsewhere.
>
> The points you and Frank raise are reasons and scenarios where
> additional hugetlb access control is necessary - preallocation,
> limited availability of 1G pages etc. But they're not reasons against
> charging faulted in hugetlb to the memcg *as well*.
>
> My point is we need both. One to manage competition over hugetlb,
> because it has unique limitations. The other to manage competition
> over host memory which hugetlb is a part of.
>
> Here is a usecase from our fleet.
>
> Imagine a configuration with two 32G containers. The machine is booted
> with hugetlb_cma=6G, and each container may or may not use up to 3
> gigantic page, depending on the workload within it. The rest is anon,
> cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
> 3G to enforce hugetlb fairness. But how do you configure memory.max to
> keep *overall* consumption, including anon, cache, slab etc. fair?
>
> If used hugetlb is charged, you can just set memory.max=32G regardless
> of the workload inside.
>
> Without it, you'd have to constantly poll hugetlb usage and readjust
> memory.max!

Yep, and I'd like to add that this could and have caused issues in
our production system, when there is a delay in memory limits
(low or max) correction. The userspace agent in charge of correcting
these only runs periodically, and within consecutive runs the system
could be in an over/underprotected state. An instantaneous charge
towards the memory controller would close this gap.

I think we need both a HugeTLB controller and memory controller
accounting.