linux-kernel - Re: [PATCH v3] memcg: introduce non-blocking limit setting option

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [day] [month] [year] [list]

Message-ID: <aBsMB3oes6Kn0TEl@tiehlicka>
Date: Wed, 7 May 2025 09:30:15 +0200
From: Michal Hocko <mhocko@...e.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>, linux-mm@...ck.org,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
	Meta kernel team <kernel-team@...a.com>,
	Greg Thelen <gthelen@...gle.com>,
	Michal Koutný <mkoutny@...e.com>,
	Tejun Heo <tj@...nel.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>,
	Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v3] memcg: introduce non-blocking limit setting option

On Tue 06-05-25 16:28:33, Shakeel Butt wrote:
> Setting the max and high limits can trigger synchronous reclaim and/or
> oom-kill if the usage is higher than the given limit.  This behavior is
> fine for newly created cgroups but it can cause issues for the node
> controller while setting limits for existing cgroups.
> 
> In our production multi-tenant and overcommitted environment, we are
> seeing priority inversion when the node controller dynamically adjusts the
> limits of running jobs of different priorities.  Based on the system
> situation, the node controller may reduce the limits of lower priority
> jobs and increase the limits of higher priority jobs.  However we are
> seeing node controller getting stuck for long period of time while
> reclaiming from lower priority jobs while setting their limits and also
> spends a lot of its own CPU.
> 
> One of the workaround we are trying is to fork a new process which sets
> the limit of the lower priority job along with setting an alarm to get
> itself killed if it get stuck in the reclaim for lower priority job.
> However we are finding it very unreliable and costly.  Either we need a
> good enough time buffer for the alarm to be delivered after setting limit
> and potentialy spend a lot of CPU in the reclaim or be unreliable in
> setting the limit for much shorter but cheaper (less reclaim) alarms.
> 
> Let's introduce new limit setting option which does not trigger reclaim
> and/or oom-kill and let the processes in the target cgroup to trigger
> reclaim and/or throttling and/or oom-kill in their next charge request.
> This will make the node controller on multi-tenant overcommitted
> environment much more reliable.

I would say this is a bit creative way to go about kernel interfaces. I
am not aware of any other precedence like that but I recognize this is
likely better than a new set of non-blocking interface.

It is a bit unfortunate that we haven't explicitly excluded O_NONBLOCK
previously so we cannot really add this functionality correctly without
risking breaking any existing users. Sure it hasn't made sense to write
to these files with O_NONBLOCK until now but there is the hope.

> Explanation from Johannes on side-effects of O_NONBLOCK limit change:
>   It's usually the allocating tasks inside the group bearing the cost of
>   limit enforcement and reclaim. This allows a (privileged) updater from
>   outside the group to keep that cost in there - instead of having to
>   help, from a context that doesn't necessarily make sense.
> 
>   I suppose the tradeoff with that - and the reason why this was doing
>   sync reclaim in the first place - is that, if the group is idle and
>   not trying to allocate more, it can take indefinitely for the new
>   limit to actually be met.
> 
>   It should be okay in most scenarios in practice. As the capacity is
>   reallocated from group A to B, B will exert pressure on A once it
>   tries to claim it and thereby shrink it down. If A is idle, that
>   shouldn't be hard. If A is running, it's likely to fault/allocate
>   soon-ish and then join the effort.
> 
>   It does leave a (malicious) corner case where A is just busy-hitting
>   its memory to interfere with the clawback. This is comparable to
>   reclaiming memory.low overage from the outside, though, which is an
>   acceptable risk. Users of O_NONBLOCK just need to be aware.

Good and useful clarification. Thx!

> Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> Acked-by: Roman Gushchin <roman.gushchin@...ux.dev>
> Acked-by: Johannes Weiner <hannes@...xchg.org>
> Cc: Greg Thelen <gthelen@...gle.com>
> Cc: Michal Hocko <mhocko@...nel.org>
> Cc: Michal Koutný <mkoutny@...e.com>
> Cc: Muchun Song <muchun.song@...ux.dev>
> Cc: Tejun Heo <tj@...nel.org>
> Cc: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Cc: Christian Brauner <brauner@...nel.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>

Acked-by: Michal Hocko <mhocko@...e.com>

Thanks!

-- 
Michal Hocko
SUSE Labs