lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <CAAVpQUDMj_1p6sVeo=bZ_u34HSX7V3WM6hYG3wHyyCACKrTKmQ@mail.gmail.com>
Date: Wed, 23 Jul 2025 11:06:14 -0700
From: Kuniyuki Iwashima <kuniyu@...gle.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: Eric Dumazet <edumazet@...gle.com>, Michal Koutný <mkoutny@...e.com>, 
	Tejun Heo <tj@...nel.org>, "David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Neal Cardwell <ncardwell@...gle.com>, Paolo Abeni <pabeni@...hat.com>, 
	Willem de Bruijn <willemb@...gle.com>, Matthieu Baerts <matttbe@...nel.org>, 
	Mat Martineau <martineau@...nel.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Andrew Morton <akpm@...ux-foundation.org>, Simon Horman <horms@...nel.org>, 
	Geliang Tang <geliang@...nel.org>, Muchun Song <muchun.song@...ux.dev>, 
	Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org, mptcp@...ts.linux.dev, 
	cgroups@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from
 global protocol memory accounting.

On Wed, Jul 23, 2025 at 10:28 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
>
> Cc Tejun & Michal to get their opinion on memcg vs cgroup vs BPF
> options.
>
> On Tue, Jul 22, 2025 at 07:35:52PM -0700, Kuniyuki Iwashima wrote:
> [...]
> > >
> > > Running workloads in root cgroup is not normal and comes with a warning
> > > of no isolation provided.
> > >
> > > I looked at the patch again to understand the modes you are introducing.
> > > Initially, I thought the series introduced multiple modes, including an
> > > option to exclude network memory from memcg accounting. However, if I
> > > understand correctly, that is not the case—the opt-out applies only to
> > > the global TCP/UDP accounting. That’s a relief, and I apologize for the
> > > misunderstanding.
> > >
> > > If I’m correct, you need a way to exclude a workload from the global
> > > TCP/UDP accounting, and currently, memcg serves as a convenient
> > > abstraction for the workload. Please let me know if I misunderstood.
> >
> > Correct.
> >
> > Currently, memcg by itself cannot guarantee that memory allocation for
> > socket buffer does not fail even when memory.current < memory.max
> > due to the global protocol limits.
> >
> > It means we need to increase the global limits to
> >
> > (bytes of TCP socket buffer in each cgroup) * (number of cgroup)
> >
> > , which is hard to predict, and I guess that's the reason why you
> > or Wei set tcp_mem[] to UINT_MAX so that we can ignore the global
> > limit.
>
> No that was not the reason. The main reason behind max tcp_mem global
> limit was it was not needed

but the global limit did take place thus you had to set tcp_mem
to unlimited.

> as memcg should account and limit the
> network memory.
> I think the reason you don't want tcp_mem global limit
> unlimited now is

memcg has been subject to the global limit from day 0.

And note that not every process is under memcg with memory.max
configured.


> you have internal feature to let workloads opt out of
> the memcg accounting of network memory which is causing isolation
> issues.
>
> >
> > But we should keep tcp_mem[] within a sane range in the first place.
> >
> > This series allows us to configure memcg limits only and let memcg
> > guarantee no failure until it fully consumes memory.max.
> >
> > The point is that memcg should not be affected by the global limits,
> > and this is orthogonal with the assumption that every workload should
> > be running under memcg.
> >
> >
> > >
> > > Now memcg is one way to represent the workload. Another more natural, at
> > > least to me, is the core cgroup. Basically cgroup.something interface.
> > > BPF is yet another option.
> > >
> > > To me cgroup seems preferrable but let's see what other memcg & cgroup
> > > folks think. Also note that for cgroup and memcg the interface will need
> > > to be hierarchical.
> >
> > As the root cgroup doesn't have the knob, these combinations are
> > considered hierarchical:
> >
> > (parent, child) = (0, 0), (0, 1), (1, 1)
> >
> > and only the pattern below is not considered hierarchical
> >
> > (parent, child) = (1, 0)
> >
> > Let's say we lock the knob at the first socket creation like your
> > idea above.
> >
> > If a parent and its child' knobs are (0, 0) and the child creates a
> > socket, the child memcg is locked as 0.  When the parent enables
> > the knob, we must check all child cgroups as well.  Or, we lock
> > the all parents' knobs when a socket is created in a child cgroup
> > with knob=0 ?  In any cases we need a global lock.
> >
> > Well, I understand that the hierarchical semantics is preferable
> > for cgroup but I think it does not resolve any real issue and rather
> > churns the code unnecessarily.
>
> All this is implementation detail and I am asking about semantics. More
> specifically:
>
> 1. Will the root be non-isolated always?

Yes, because the root cgroup doesn't have memcg.
Also, the knob has CFTYPE_NOT_ON_ROOT.


> 2. If a cgroup is isolated, does it mean all its desendants are
>    isolated?

No, but this is because we MUST think about how we handle
the scenario above that (parent, child) = (0,0) becomes (1, 0).

We cannot think about the semantics without implementation
detail.  And if we allow such scenario, the hierarchical semantics
is fake and has no meaning.


> 3. Will there ever be a reasonable use-case where there is non-isolated
>    sub-tree under an isolated ancestor?

I think no, but again, we need to think about the scenario above,
otherwise, your ideal semantics is just broken.

Also, "no reasonable scenario" does not always mean "we must
prevent the scenario".

If there's nothing harmful, we can just let it be, especially if such
restriction gives nothing andrather hurts performance with no
good reason.


>
> Please give some thought to the above (and related) questions.

Please think about the implementation detail and if its trade-off
(just keeping semantics vs code churn & perf regression) makes
really sense.


>
> I am still not convinced that memcg is the right home for this opt-out
> feature. I have CCed cgroup folks to get their opinion as well.

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ