[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aBsMB3oes6Kn0TEl@tiehlicka>
Date: Wed, 7 May 2025 09:30:15 +0200
From: Michal Hocko <mhocko@...e.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>, linux-mm@...ck.org,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
Meta kernel team <kernel-team@...a.com>,
Greg Thelen <gthelen@...gle.com>,
Michal Koutný <mkoutny@...e.com>,
Tejun Heo <tj@...nel.org>, Yosry Ahmed <yosry.ahmed@...ux.dev>,
Christian Brauner <brauner@...nel.org>
Subject: Re: [PATCH v3] memcg: introduce non-blocking limit setting option
On Tue 06-05-25 16:28:33, Shakeel Butt wrote:
> Setting the max and high limits can trigger synchronous reclaim and/or
> oom-kill if the usage is higher than the given limit. This behavior is
> fine for newly created cgroups but it can cause issues for the node
> controller while setting limits for existing cgroups.
>
> In our production multi-tenant and overcommitted environment, we are
> seeing priority inversion when the node controller dynamically adjusts the
> limits of running jobs of different priorities. Based on the system
> situation, the node controller may reduce the limits of lower priority
> jobs and increase the limits of higher priority jobs. However we are
> seeing node controller getting stuck for long period of time while
> reclaiming from lower priority jobs while setting their limits and also
> spends a lot of its own CPU.
>
> One of the workaround we are trying is to fork a new process which sets
> the limit of the lower priority job along with setting an alarm to get
> itself killed if it get stuck in the reclaim for lower priority job.
> However we are finding it very unreliable and costly. Either we need a
> good enough time buffer for the alarm to be delivered after setting limit
> and potentialy spend a lot of CPU in the reclaim or be unreliable in
> setting the limit for much shorter but cheaper (less reclaim) alarms.
>
> Let's introduce new limit setting option which does not trigger reclaim
> and/or oom-kill and let the processes in the target cgroup to trigger
> reclaim and/or throttling and/or oom-kill in their next charge request.
> This will make the node controller on multi-tenant overcommitted
> environment much more reliable.
I would say this is a bit creative way to go about kernel interfaces. I
am not aware of any other precedence like that but I recognize this is
likely better than a new set of non-blocking interface.
It is a bit unfortunate that we haven't explicitly excluded O_NONBLOCK
previously so we cannot really add this functionality correctly without
risking breaking any existing users. Sure it hasn't made sense to write
to these files with O_NONBLOCK until now but there is the hope.
> Explanation from Johannes on side-effects of O_NONBLOCK limit change:
> It's usually the allocating tasks inside the group bearing the cost of
> limit enforcement and reclaim. This allows a (privileged) updater from
> outside the group to keep that cost in there - instead of having to
> help, from a context that doesn't necessarily make sense.
>
> I suppose the tradeoff with that - and the reason why this was doing
> sync reclaim in the first place - is that, if the group is idle and
> not trying to allocate more, it can take indefinitely for the new
> limit to actually be met.
>
> It should be okay in most scenarios in practice. As the capacity is
> reallocated from group A to B, B will exert pressure on A once it
> tries to claim it and thereby shrink it down. If A is idle, that
> shouldn't be hard. If A is running, it's likely to fault/allocate
> soon-ish and then join the effort.
>
> It does leave a (malicious) corner case where A is just busy-hitting
> its memory to interfere with the clawback. This is comparable to
> reclaiming memory.low overage from the outside, though, which is an
> acceptable risk. Users of O_NONBLOCK just need to be aware.
Good and useful clarification. Thx!
> Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> Acked-by: Roman Gushchin <roman.gushchin@...ux.dev>
> Acked-by: Johannes Weiner <hannes@...xchg.org>
> Cc: Greg Thelen <gthelen@...gle.com>
> Cc: Michal Hocko <mhocko@...nel.org>
> Cc: Michal Koutný <mkoutny@...e.com>
> Cc: Muchun Song <muchun.song@...ux.dev>
> Cc: Tejun Heo <tj@...nel.org>
> Cc: Yosry Ahmed <yosry.ahmed@...ux.dev>
> Cc: Christian Brauner <brauner@...nel.org>
> Cc: Andrew Morton <akpm@...ux-foundation.org>
Acked-by: Michal Hocko <mhocko@...e.com>
Thanks!
--
Michal Hocko
SUSE Labs
Powered by blists - more mailing lists