linux-kernel - Re: [PATCH] memcg: add hierarchical effective limits for v2

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <j2qxkp3tjq7yenl3tkjisaxry7aqdwaxlpx7rn7mpkdi7fkf2c@xva7vgzspb2p>
Date: Mon, 10 Feb 2025 10:34:54 -0800
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Michal Koutný <mkoutny@...e.com>
Cc: "T.J. Mercier" <tjmercier@...gle.com>, Tejun Heo <tj@...nel.org>, 
	Johannes Weiner <hannes@...xchg.org>, Michal Hocko <mhocko@...nel.org>, 
	Roman Gushchin <roman.gushchin@...ux.dev>, Muchun Song <muchun.song@...ux.dev>, linux-mm@...ck.org, 
	cgroups@...r.kernel.org, linux-kernel@...r.kernel.org, 
	Meta kernel team <kernel-team@...a.com>
Subject: Re: [PATCH] memcg: add hierarchical effective limits for v2

On Mon, Feb 10, 2025 at 05:24:17PM +0100, Michal Koutný wrote:
> Hello.
> 
> On Thu, Feb 06, 2025 at 11:09:05AM -0800, Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> > Oh I totally forgot about your series. In my use-case, it is not about
> > dynamically knowning how much they can expand and adjust themselves but
> > rather knowing statically upfront what resources they have been given.
> 
> From the memcg PoV, the effective value doesn't tell how much they were
> given (because of sharing).
> 
> > More concretely, these are workloads which used to completely occupy a
> > single machine, though within containers but without limits. These
> > workloads used to look at machine level metrics at startup on how much
> > resources are available.
> 
> I've been there but haven't found convincing mapping of global to memcg
> limits.
> 
> The issue is that such a value won't guarantee no OOM when below because
> it can be (generally) effectively shared.
> 
> (Alas, apps typically don't express their memory needs in units of
> PSI. So it boils down to a system wide monitor like systemd-oomd and
> cooperation with it.)
> 

I think you missed the static partitioning of resources use-case I
mentioned. The issue you are pointing exist for the system level metrics
as well i.e. a worklod looking at system metrics can't say how much they
are given but in my specific case, the workloads know they occupy the
full machine. Now we want to move such workloads to multi-tenant
environment but the resources are still statically partitioned and not
overcommitted, so effective limit will tell how much they are given.

> > Now these workloads are being moved to multi-tenant environment but
> > still the machine is partitioned statically between the workloads. So,
> > these workloads need to know upfront how much resources are allocated to
> > them upfront and the way the cgroup hierarchy is setup, that information
> > is a bit above the tree.
> 
> FTR, e.g. in systemd setups, this can be partially overcome by exposed
> EffectiveMemoryMax= (the service manager who configures the resources
> also can do the ancestry traversal).
> kubernetes has downward API where generic resource info is shared into
> containers and I recall that lxcfs could mangle procfs
> memory info wrt memory limits for legacy apps.
> 
> 
> As I think about it, the cgroupns (in)visibility should be resolved by
> assigning the proper limit to namespace's root group memory.max (read
> only for contained user) and the traversal...
> 

I think here your point is why not have userspace based solution. I
think it is possible but not convenient and adds an external dependency
in the workload.

> 
> On Thu, Feb 06, 2025 at 11:37:31AM -0800, "T.J. Mercier" <tjmercier@...gle.com> wrote:
> > but having a single file to read instead of walking up the
> > tree with multiple reads to calculate an effective limit would be
> > nice.
> 
> ...in kernel is nice but possible performance gain isn't worth hiding
> the shareability of the effective limit.
> 
> 
> So I wonder what is the current PoV of more MM people...

Yup, let's see more opinion on this.

Thanks Michal for your feedback.