linux-kernel - Re: [PATCH] mm: allow exiting processes to exceed the memory.max limit

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z1cyExTkg3OoaJy5@tiehlicka>
Date: Mon, 9 Dec 2024 19:08:19 +0100
From: Michal Hocko <mhocko@...e.com>
To: Rik van Riel <riel@...riel.com>
Cc: Johannes Weiner <hannes@...xchg.org>, kernel-team@...a.com,
	linux-kernel@...r.kernel.org, linux-mm@...ck.org,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Andrew Morton <akpm@...ux-foundation.org>, cgroups@...r.kernel.org
Subject: Re: [PATCH] mm: allow exiting processes to exceed the memory.max
 limit

On Mon 09-12-24 12:42:33, Rik van Riel wrote:
> It is possible for programs to get stuck in exit, when their
> memcg is at or above the memory.max limit, and things like
> the do_futex() call from mm_release() need to page memory in.
> 
> This can hang forever, but it really doesn't have to.

Are you sure this is really happening?

> 
> The amount of memory that the exit path will page into memory
> should be relatively small, and letting exit proceed faster
> will free up memory faster.
> 
> Allow PF_EXITING tasks to bypass the cgroup memory.max limit
> the same way PF_MEMALLOC already does.
> 
> Signed-off-by: Rik van Riel <riel@...riel.com>
> ---
>  mm/memcontrol.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 7b3503d12aaf..d1abef1138ff 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2218,11 +2218,12 @@ int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	/*
>  	 * Prevent unbounded recursion when reclaim operations need to
> -	 * allocate memory. This might exceed the limits temporarily,
> -	 * but we prefer facilitating memory reclaim and getting back
> -	 * under the limit over triggering OOM kills in these cases.
> +	 * allocate memory, or the process is exiting. This might exceed
> +	 * the limits temporarily, but we prefer facilitating memory reclaim
> +	 * and getting back under the limit over triggering OOM kills in
> +	 * these cases.
>  	 */
> -	if (unlikely(current->flags & PF_MEMALLOC))
> +	if (unlikely(current->flags & (PF_MEMALLOC | PF_EXITING)))
>  		goto force;

We already have task_is_dying() bail out. Why is that insufficient?
It is currently hitting when the oom situation is triggered while your
patch is triggering this much earlier. We used to do that in the past
but this got changed by a4ebf1b6ca1e ("memcg: prohibit unconditional
exceeding the limit of dying tasks"). I believe the situation in vmalloc
has changed since then but I suspect the fundamental problem that the
amount of memory dying tasks could allocate a lot of memory stays.

There is still this
:     It has been observed that it is not really hard to trigger these
:     bypasses and cause global OOM situation.
that really needs to be re-evaluated.
-- 
Michal Hocko
SUSE Labs