[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAKEwX=Ocm_Zn=3P0gBdJKSwyqWq3fX37OEGAjCA5vKgJb+QGvw@mail.gmail.com>
Date: Wed, 27 Sep 2023 10:22:54 -0700
From: Nhat Pham <nphamcs@...il.com>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Michal Hocko <mhocko@...e.com>,
Frank van der Linden <fvdl@...gle.com>,
akpm@...ux-foundation.org, riel@...riel.com,
roman.gushchin@...ux.dev, shakeelb@...gle.com,
muchun.song@...ux.dev, tj@...nel.org, lizefan.x@...edance.com,
shuah@...nel.org, mike.kravetz@...cle.com, yosryahmed@...gle.com,
linux-mm@...ck.org, kernel-team@...a.com,
linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH 0/2] hugetlb memcg accounting
On Wed, Sep 27, 2023 at 9:44 AM Johannes Weiner <hannes@...xchg.org> wrote:
>
> On Wed, Sep 27, 2023 at 02:50:10PM +0200, Michal Hocko wrote:
> > On Tue 26-09-23 18:14:14, Johannes Weiner wrote:
> > [...]
> > > The fact that memory consumed by hugetlb is currently not considered
> > > inside memcg (host memory accounting and control) is inconsistent. It
> > > has been quite confusing to our service owners and complicating things
> > > for our containers team.
> >
> > I do understand how that is confusing and inconsistent as well. Hugetlb
> > is bringing throughout its existence I am afraid.
> >
> > As noted in other reply though I am not sure hugeltb pool can be
> > reasonably incorporated with a sane semantic. Neither of the regular
> > allocation nor the hugetlb reservation/actual use can fallback to the
> > pool of the other. This makes them 2 different things each hitting their
> > own failure cases that require a dedicated handling.
> >
> > Just from top of my head these are cases I do not see easy way out from:
> > - hugetlb charge failure has two failure modes - pool empty
> > or memcg limit reached. The former is not recoverable and
> > should fail without any further intervention the latter might
> > benefit from reclaiming.
> > - !hugetlb memory charge failure cannot consider any hugetlb
> > pages - they are implicit memory.min protection so it is
> > impossible to manage reclaim protection without having a
> > knowledge of the hugetlb use.
> > - there is no way to control the hugetlb pool distribution by
> > memcg limits. How do we distinguish reservations from actual
> > use?
> > - pre-allocated pool is consuming memory without any actual
> > owner until it is actually used and even that has two stages
> > (reserved and really used). This makes it really hard to
> > manage memory as whole when there is a considerable amount of
> > hugetlb memore preallocated.
>
> It's important to distinguish hugetlb access policy from memory use
> policy. This patch isn't about hugetlb access, it's about general
> memory use.
>
> Hugetlb access policy is a separate domain with separate
> answers. Preallocating is a privileged operation, for access control
> there is the hugetlb cgroup controller etc.
>
> What's missing is that once you get past the access restrictions and
> legitimately get your hands on huge pages, that memory use gets
> reflected in memory.current and exerts pressure on *other* memory
> inside the group, such as anon or optimistic cache allocations.
>
> Note that hugetlb *can* be allocated on demand. It's unexpected that
> when an application optimistically allocates a couple of 2M hugetlb
> pages those aren't reflected in its memory.current. The same is true
> for hugetlb_cma. If the gigantic pages aren't currently allocated to a
> cgroup, that CMA memory can be used for movable memory elsewhere.
>
> The points you and Frank raise are reasons and scenarios where
> additional hugetlb access control is necessary - preallocation,
> limited availability of 1G pages etc. But they're not reasons against
> charging faulted in hugetlb to the memcg *as well*.
>
> My point is we need both. One to manage competition over hugetlb,
> because it has unique limitations. The other to manage competition
> over host memory which hugetlb is a part of.
>
> Here is a usecase from our fleet.
>
> Imagine a configuration with two 32G containers. The machine is booted
> with hugetlb_cma=6G, and each container may or may not use up to 3
> gigantic page, depending on the workload within it. The rest is anon,
> cache, slab, etc. You set the hugetlb cgroup limit of each cgroup to
> 3G to enforce hugetlb fairness. But how do you configure memory.max to
> keep *overall* consumption, including anon, cache, slab etc. fair?
>
> If used hugetlb is charged, you can just set memory.max=32G regardless
> of the workload inside.
>
> Without it, you'd have to constantly poll hugetlb usage and readjust
> memory.max!
Yep, and I'd like to add that this could and have caused issues in
our production system, when there is a delay in memory limits
(low or max) correction. The userspace agent in charge of correcting
these only runs periodically, and within consecutive runs the system
could be in an over/underprotected state. An instantaneous charge
towards the memory controller would close this gap.
I think we need both a HugeTLB controller and memory controller
accounting.
Powered by blists - more mailing lists