linux-kernel - Re: [PATCH] memcg: introduce non-blocking limit setting interfaces

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAHH2K0as=b+EhxG=8yS9T9oP40U2dEtU0NA=wCJSb6ii9_DGaw@mail.gmail.com>
Date: Fri, 18 Apr 2025 13:18:53 -0700
From: Greg Thelen <gthelen@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Muchun Song <muchun.song@...ux.dev>, Yosry Ahmed <yosry.ahmed@...ux.dev>, Tejun Heo <tj@...nel.org>, 
	Michal Koutný <mkoutny@...e.com>, linux-mm@...ck.org, 
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH] memcg: introduce non-blocking limit setting interfaces

On Fri, Apr 18, 2025 at 1:00 PM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> Setting the max and high limits can trigger synchronous reclaim and/or
> oom-kill if the usage is higher than the given limit. This behavior is
> fine for newly created cgroups but it can cause issues for the node
> controller while setting limits for existing cgroups.
>
> In our production multi-tenant and overcommitted environment, we are
> seeing priority inversion when the node controller dynamically adjusts
> the limits of running jobs of different priorities. Based on the system
> situation, the node controller may reduce the limits of lower priority
> jobs and increase the limits of higher priority jobs. However we are
> seeing node controller getting stuck for long period of time while
> reclaiming from lower priority jobs while setting their limits and also
> spends a lot of its own CPU.
>
> One of the workaround we are trying is to fork a new process which sets
> the limit of the lower priority job along with setting an alarm to get
> itself killed if it get stuck in the reclaim for lower priority job.
> However we are finding it very unreliable and costly. Either we need a
> good enough time buffer for the alarm to be delivered after setting
> limit and potentialy spend a lot of CPU in the reclaim or be unreliable
> in setting the limit for much shorter but cheaper (less reclaim) alarms.
>
> Let's introduce new limit setting interfaces which does not trigger
> reclaim and/or oom-kill and let the processes in the target cgroup to
> trigger reclaim and/or throttling and/or oom-kill in their next charge
> request. This will make the node controller on multi-tenant
> overcommitted environment much more reliable.

Would opening the typical synchronous files (e.g. memory.max) with
O_NONBLOCK be a more general way to tell the kernel that the user
space controller doesn't want to wait? It's not quite consistent with
traditional use of O_NONBLOCK, which would make operations to
fully succeed or fail, rather than altering the operation being requested.
But O_NONBLOCK would allow for a semantics of non-blocking
reclaim, if that's fast enough for your controller.

> Signed-off-by: Shakeel Butt <shakeel.butt@...ux.dev>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 16 +++++++++
>  mm/memcontrol.c                         | 46 +++++++++++++++++++++++++
>  2 files changed, 62 insertions(+)
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 8fb14ffab7d1..7b459c821afa 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1299,6 +1299,14 @@ PAGE_SIZE multiple when read back.
>         monitors the limited cgroup to alleviate heavy reclaim
>         pressure.
>
> +  memory.high.nonblock
> +        This is the same limit as memory.high but have different
> +        behaviour for the writer of this interface. The program setting
> +        the limit will not trigger reclaim synchronously if the
> +        usage is higher than the limit and let the processes in the
> +        target cgroup to trigger reclaim and/or get throttled on
> +        hitting the high limit.
> +
>    memory.max
>         A read-write single value file which exists on non-root
>         cgroups.  The default is "max".
> @@ -1316,6 +1324,14 @@ PAGE_SIZE multiple when read back.
>         Caller could retry them differently, return into userspace
>         as -ENOMEM or silently ignore in cases like disk readahead.
>
> +  memory.max.nonblock
> +        This is the same limit as memory.max but have different
> +        behaviour for the writer of this interface. The program setting
> +        the limit will not trigger reclaim synchronously and/or trigger
> +        the oom-kill if the usage is higher than the limit and let the
> +        processes in the target cgroup to trigger reclaim and/or get
> +        oom-killed on hitting their max limit.
> +
>    memory.reclaim
>         A write-only nested-keyed file which exists for all cgroups.
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5e2ea8b8a898..6ad1464b621a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -4279,6 +4279,23 @@ static ssize_t memory_high_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static ssize_t memory_high_nonblock_write(struct kernfs_open_file *of,
> +                                         char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       unsigned long high;
> +       int err;
> +
> +       buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "max", &high);
> +       if (err)
> +               return err;
> +
> +       page_counter_set_high(&memcg->memory, high);
> +       memcg_wb_domain_size_changed(memcg);
> +       return nbytes;
> +}
> +
>  static int memory_max_show(struct seq_file *m, void *v)
>  {
>         return seq_puts_memcg_tunable(m,
> @@ -4333,6 +4350,23 @@ static ssize_t memory_max_write(struct kernfs_open_file *of,
>         return nbytes;
>  }
>
> +static ssize_t memory_max_nonblock_write(struct kernfs_open_file *of,
> +                                        char *buf, size_t nbytes, loff_t off)
> +{
> +       struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
> +       unsigned long max;
> +       int err;
> +
> +       buf = strstrip(buf);
> +       err = page_counter_memparse(buf, "max", &max);
> +       if (err)
> +               return err;
> +
> +       xchg(&memcg->memory.max, max);
> +       memcg_wb_domain_size_changed(memcg);
> +       return nbytes;
> +}
> +
>  /*
>   * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
>   * if any new events become available.
> @@ -4557,12 +4591,24 @@ static struct cftype memory_files[] = {
>                 .seq_show = memory_high_show,
>                 .write = memory_high_write,
>         },
> +       {
> +               .name = "high.nonblock",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = memory_high_show,
> +               .write = memory_high_nonblock_write,
> +       },
>         {
>                 .name = "max",
>                 .flags = CFTYPE_NOT_ON_ROOT,
>                 .seq_show = memory_max_show,
>                 .write = memory_max_write,
>         },
> +       {
> +               .name = "max.nonblock",
> +               .flags = CFTYPE_NOT_ON_ROOT,
> +               .seq_show = memory_max_show,
> +               .write = memory_max_nonblock_write,
> +       },
>         {
>                 .name = "events",
>                 .flags = CFTYPE_NOT_ON_ROOT,
> --
> 2.47.1
>
>