linux-kernel - Re: [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in memory controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <ZRrI90KcRBwVZn/r@dhcp22.suse.cz>
Date:   Mon, 2 Oct 2023 15:43:19 +0200
From:   Michal Hocko <mhocko@...e.com>
To:     Nhat Pham <nphamcs@...il.com>
Cc:     akpm@...ux-foundation.org, riel@...riel.com, hannes@...xchg.org,
        roman.gushchin@...ux.dev, shakeelb@...gle.com,
        muchun.song@...ux.dev, tj@...nel.org, lizefan.x@...edance.com,
        shuah@...nel.org, mike.kravetz@...cle.com, yosryahmed@...gle.com,
        linux-mm@...ck.org, kernel-team@...a.com,
        linux-kernel@...r.kernel.org, cgroups@...r.kernel.org
Subject: Re: [PATCH v2 1/2] hugetlb: memcg: account hugetlb-backed memory in
 memory controller

On Wed 27-09-23 17:57:22, Nhat Pham wrote:
> Currently, hugetlb memory usage is not acounted for in the memory
> controller, which could lead to memory overprotection for cgroups with
> hugetlb-backed memory. This has been observed in our production system.
> 
> This patch rectifies this issue by charging the memcg when the hugetlb
> folio is allocated, and uncharging when the folio is freed (analogous to
> the hugetlb controller).

This changelog is missing a lot of information. Both about the usecase
(we do not want to fish that out from archives in the future) and the
actual implementation and the reasoning behind that.

AFAICS you have decided to charge on the hugetlb use rather than hugetlb
allocation to the pool. I suspect the underlying reasoning is that pool
pages do not belong to anybody. This is a deliberate decision and it
should be documented as such.

It is also very important do describe subtle behavior properties that
might be rather unintuitive to users. Most notably 
- there is no hugetlb pool management involved in the memcg
  controller. One has to use hugetlb controller for that purpose.
  Also the pre allocated pool as such doesn't belong to anybody so the
  memcg host overcommit management has to consider it when configuring
  hard limits.
- memcg limit reclaim doesn't assist hugetlb pages allocation when
  hugetlb overcommit is configured (i.e. pages are not consumed from the
  pool) which means that the page allocation might disrupt workloads
  from other memcgs.
- failure to charge a hugetlb page results in SIGBUS rather
  than memcg oom killer. That could be the case even if the
  hugetlb pool still has pages available and there is
  reclaimable memory in the memcg.
- hugetlb pages are contributing to memory reclaim protection 
  implicitly. This means that the low,min limits tunning has to consider
  hugetlb memory as well.

I suspect there is more than the above. To be completely honest I am
still not convinced this is a good idea.

I do recognize that this might work in a very limited environments but
hugetlb management is quite challenging on its own and this just adds
another layer of complexity which is really hard to see through without
an intimate understanding of both memcg and hugetlb. The reason that
hugetlb has been living outside of the core MM (and memcg) is not just
because we like it that way. And yes I do fully understand that users
shouldn't really care about that because this is just a memory to them.

We should also consider the global control for this functionality. I am
especially worried about setups where a mixed bag of workloads
(containers) is executed. While some of them will be ready for the new
accounting mode many will leave in their own world without ever being
modified. How do we deal with that situation?

All that being said, I am not going to ack nor nack this but I really do
prefer to be much more explicit about the motivation and current
implementation specifics so that we can forward users to something
they can digest.

> Signed-off-by: Nhat Pham <nphamcs@...il.com>
[...]

a minor implementation detail below. I couldn't spot anything obviously
broken with the rest of the hugetlb specific code. restore_reserve_on_memcg_failure
is rather clumsy and potentially error prone but I will leave that out
to Mike as he is much more familiar with that behavior than me.

> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index de220e3ff8be..ff88ea4df11a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
[...]
> @@ -3119,6 +3121,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
>  					pages_per_huge_page(h), folio);
>  	}
> +
> +	/* undo allocation if memory controller disallows it. */
> +	if (mem_cgroup_hugetlb_charge_folio(folio, GFP_KERNEL)) {

htlb_alloc_mask(h) rather than GFP_KERNEL. Ideally with
__GFP_RETRY_MAYFAIL which is a default allocation policy.

> +		if (restore_reserve_on_memcg_failure)
> +			restore_reserve_on_error(h, vma, addr, folio);
> +		folio_put(folio);
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
>  	return folio;
>  
>  out_uncharge_cgroup:

-- 
Michal Hocko
SUSE Labs