linux-kernel - Re: [PATCH] memcg: add hierarchical effective limits for v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <Z6rYReNBVNyYq-Sg@google.com>
Date: Tue, 11 Feb 2025 04:55:33 +0000
From: Roman Gushchin <roman.gushchin@...ux.dev>
To: Johannes Weiner <hannes@...xchg.org>
Cc: Michal Koutný <mkoutny@...e.com>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	"T.J. Mercier" <tjmercier@...gle.com>, Tejun Heo <tj@...nel.org>,
	Michal Hocko <mhocko@...nel.org>,
	Muchun Song <muchun.song@...ux.dev>, linux-mm@...ck.org,
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
	Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH] memcg: add hierarchical effective limits for v2

On Mon, Feb 10, 2025 at 05:52:34PM -0500, Johannes Weiner wrote:
> On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote:
> > Hello.
> > 
> > On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > > Oh I totally forgot about your series. In my use-case, it is not about
> > > dynamically knowning how much they can expand and adjust themselves but
> > > rather knowing statically upfront what resources they have been given.
> > 
> > From the memcg PoV, the effective value doesn't tell how much they were
> > given (because of sharing).
> 
> It's definitely true that if you have an ancestral limit for several
> otherwise unlimited siblings, then interpreting this number as "this
> is how much memory I have available" will be completely misleading.
> 
> I would also say that sharing a limit with several siblings requires a
> certain degree of awareness and cooperation between them. From that
> POV, IMO it would be fine to provide a metric with contextual caveats.
> 
> The problem is, what do we do with canned, unaware, maybe untrusted
> applications? And they don't necessarily know which they are.
> 
> It depends heavily on the judgement of the administrator of any given
> deployment. Some workloads might be completely untrusted and hard
> limited. Another deployment might consider the same workload
> reasonably predictable that it's configured only with a failsafe max
> limit that is much higher than where the workload is *expected* to
> operate. The allotment might happen altogether with min/low
> protections and no max limit. Or there could be a combination of
> protection slightly below and a limit slightly above the expected
> workload size.
> 
> It seems basically impossible to write portable code against this
> without knowing the intent of the person setting it up.
> 
> But how do we communicate intent down to the container? The two broad
> options are implicitly or explicitly:
> 
> a) Provide a cgroup file that automatically derives intended target
>    size from how min/low/high/max are set up.
> 
>    Right now those can be set up super loosely depending on what the
>    administrator thinks about the application. In order for this to
>    work, we'd likely have to define an idiomatic way of configuring
>    the controller. E.g. if you set max by itself, we assume this is
>    the target size. If you set low, with or without max, then low is
>    the target size. Or if you set both, target is in between.
> 
>    I'm not completely convinced this is workable.

This sounds like memory.available.

It's hard to implement well, especially taking into account things like
numa, memory sharing, estimating how much can be reclaimed, etc.

But at the same time there is a value in providing such metric.
There is a clear use case. And it's even harder to implement this
in userspace.

> b) Provide a cgroup file that is freely configurable by the
>    administrator with the target size of the container.
> 
>    This has obvious drawbacks as well. What's the default value? Also,
>    a lot of setups are dead simple: set a hard limit and expect the
>    workload to adhere to that, period. Nobody is going to reliably set
>    another cgroup file that a workload may or may not consume.

Yeah, this is a weird option.

> 
> The third option is to wash our hands of all of this, provide the
> static hierarchy settings to the leaves (like this patch, plus do it
> for the other knobs as well) and let userspace figure it out.

Idk, I see a very little value in it. I'm not necessarily opposing this patchset,
just not seeing a lot of value.

Maybe I'm missing something, but somehow it wasn't a problem for many years.
Nothing really changed here.

So maybe someone can come up with a better explanation of a specific problem
we're trying to solve here?

Thanks!