[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHH2K0as=b+EhxG=8yS9T9oP40U2dEtU0NA=wCJSb6ii9_DGaw@mail.gmail.com>
Date: Fri, 18 Apr 2025 13:18:53 -0700
From: Greg Thelen <gthelen@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>, Yosry Ahmed <yosry.ahmed@...ux.dev>, Tejun Heo <tj@...nel.org>,
Michal Koutný <mkoutny@...e.com>, linux-mm@...ck.org,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH] memcg: introduce non-blocking limit setting interfaces
On Fri, Apr 18, 2025 at 1:00 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> Setting the max and high limits can trigger synchronous reclaim and/or
> oom-kill if the usage is higher than the given limit. This behavior is
> fine for newly created cgroups but it can cause issues for the node
> controller while setting limits for existing cgroups.
>
> In our production multi-tenant and overcommitted environment, we are
> seeing priority inversion when the node controller dynamically adjusts
> the limits of running jobs of different priorities. Based on the system
> situation, the node controller may reduce the limits of lower priority
> jobs and increase the limits of higher priority jobs. However we are
> seeing node controller getting stuck for long period of time while
> reclaiming from lower priority jobs while setting their limits and also
> spends a lot of its own CPU.
>
> One of the workaround we are trying is to fork a new process which sets
> the limit of the lower priority job along with setting an alarm to get
> itself killed if it get stuck in the reclaim for lower priority job.
> However we are finding it very unreliable and costly. Either we need a
> good enough time buffer for the alarm to be delivered after setting
> limit and potentialy spend a lot of CPU in the reclaim or be unreliable
> in setting the limit for much shorter but cheaper (less reclaim) alarms.
>
> Let's introduce new limit setting interfaces which does not trigger
> reclaim and/or oom-kill and let the processes in the target cgroup to
> trigger reclaim and/or throttling and/or oom-kill in their next charge
> request. This will make the node controller on multi-tenant
> overcommitted environment much more reliable.
Would opening the typical synchronous files (e.g. memory.max) with
O_NONBLOCK be a more general way to tell the kernel that the user
space controller doesn't want to wait? It's not quite consistent with
traditional use of O_NONBLOCK, which would make operations to
fully succeed or fail, rather than altering the operation being requested.
But O_NONBLOCK would allow for a semantics of non-blocking
reclaim, if that's fast enough for your controller.
> Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> ---
> Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++
> mm/memcontrol.c | 46 +++++++++++++++++++++++++
> 2 files changed, 62 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 8fb14ffab7d1..7b459c821afa 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1299,6 +1299,14 @@ PAGE_SIZE multiple when read back.
> monitors the limited cgroup to alleviate heavy reclaim
> pressure.
>
> + memory.high.nonblock
> + This is the same limit as memory.high but have different
> + behaviour for the writer of this interface. The program setting
> + the limit will not trigger reclaim synchronously if the
> + usage is higher than the limit and let the processes in the
> + target cgroup to trigger reclaim and/or get throttled on
> + hitting the high limit.
> +
> memory.max
> A read-write single value file which exists on non-root
> cgroups. The default is "max".
> @@ -1316,6 +1324,14 @@ PAGE_SIZE multiple when read back.
> Caller could retry them differently, return into userspace
> as -ENOMEM or silently ignore in cases like disk readahead.
>
> + memory.max.nonblock
> + This is the same limit as memory.max but have different
> + behaviour for the writer of this interface. The program setting
> + the limit will not trigger reclaim synchronously and/or trigger
> + the oom-kill if the usage is higher than the limit and let the
> + processes in the target cgroup to trigger reclaim and/or get
> + oom-killed on hitting their max limit.
> +
> memory.reclaim
> A write-only nested-keyed file which exists for all cgroups.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e2ea8b8a898..6ad1464b621a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4279,6 +4279,23 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static ssize_t memory_high_nonblock_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long high;
> + int err;
> +
> + buf = strstrip(buf);
> + err = page_counter_memparse(buf, "max", &high);
> + if (err)
> + return err;
> +
> + page_counter_set_high(&memcg->memory, high);
> + memcg_wb_domain_size_changed(memcg);
> + return nbytes;
> +}
> +
> static int memory_max_show(struct seq_file *m, void *v)
> {
> return seq_puts_memcg_tunable(m,
> @@ -4333,6 +4350,23 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
> return nbytes;
> }
>
> +static ssize_t memory_max_nonblock_write(struct kernfs_open_file *of,
> + char *buf, size_t nbytes, loff_t off)
> +{
> + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> + unsigned long max;
> + int err;
> +
> + buf = strstrip(buf);
> + err = page_counter_memparse(buf, "max", &max);
> + if (err)
> + return err;
> +
> + xchg(&memcg->memory.max, max);
> + memcg_wb_domain_size_changed(memcg);
> + return nbytes;
> +}
> +
> /*
> * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
> * if any new events become available.
> @@ -4557,12 +4591,24 @@ static struct cftype memory_files[] = {
> .seq_show = memory_high_show,
> .write = memory_high_write,
> },
> + {
> + .name = "high.nonblock",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_high_show,
> + .write = memory_high_nonblock_write,
> + },
> {
> .name = "max",
> .flags = CFTYPE_NOT_ON_ROOT,
> .seq_show = memory_max_show,
> .write = memory_max_write,
> },
> + {
> + .name = "max.nonblock",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .seq_show = memory_max_show,
> + .write = memory_max_nonblock_write,
> + },
> {
> .name = "events",
> .flags = CFTYPE_NOT_ON_ROOT,
> --
> 2.47.1
>
>
Powered by blists - more mailing lists