netdev - Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from global protocol memory accounting.

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <jc6z5d7d26zunaf6b4qtwegdoljz665jjcigb4glkb6hdy6ap2@2gn6s52s6vfw>
Date: Tue, 22 Jul 2025 08:52:25 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Eric Dumazet <edumazet@...gle.com>
Cc: Kuniyuki Iwashima <kuniyu@...gle.com>, 
	"David S. Miller" <davem@...emloft.net>, Jakub Kicinski <kuba@...nel.org>, 
	Neal Cardwell <ncardwell@...gle.com>, Paolo Abeni <pabeni@...hat.com>, 
	Willem de Bruijn <willemb@...gle.com>, Matthieu Baerts <matttbe@...nel.org>, 
	Mat Martineau <martineau@...nel.org>, Johannes Weiner <hannes@...xchg.org>, 
	Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>, 
	Andrew Morton <akpm@...ux-foundation.org>, Simon Horman <horms@...nel.org>, 
	Geliang Tang <geliang@...nel.org>, Muchun Song <muchun.song@...ux.dev>, 
	Kuniyuki Iwashima <kuni1840@...il.com>, netdev@...r.kernel.org, mptcp@...ts.linux.dev, 
	cgroups@...r.kernel.org, linux-mm@...ck.org
Subject: Re: [PATCH v1 net-next 13/13] net-memcg: Allow decoupling memcg from
 global protocol memory accounting.

On Tue, Jul 22, 2025 at 08:24:23AM -0700, Eric Dumazet wrote:
> On Tue, Jul 22, 2025 at 8:14 AM Shakeel Butt <shakeel.butt@...ux.dev> wrote:
> >
> > On Mon, Jul 21, 2025 at 08:35:32PM +0000, Kuniyuki Iwashima wrote:
> > > Some protocols (e.g., TCP, UDP) implement memory accounting for socket
> > > buffers and charge memory to per-protocol global counters pointed to by
> > > sk->sk_proto->memory_allocated.
> > >
> > > When running under a non-root cgroup, this memory is also charged to the
> > > memcg as sock in memory.stat.
> > >
> > > Even when memory usage is controlled by memcg, sockets using such protocols
> > > are still subject to global limits (e.g., /proc/sys/net/ipv4/tcp_mem).
> > >
> > > This makes it difficult to accurately estimate and configure appropriate
> > > global limits, especially in multi-tenant environments.
> > >
> > > If all workloads were guaranteed to be controlled under memcg, the issue
> > > could be worked around by setting tcp_mem[0~2] to UINT_MAX.
> > >
> > > In reality, this assumption does not always hold, and a single workload
> > > that opts out of memcg can consume memory up to the global limit,
> > > becoming a noisy neighbour.
> > >
> >
> > Sorry but the above is not reasonable. On a multi-tenant system no
> > workload should be able to opt out of memcg accounting if isolation is
> > needed. If a workload can opt out then there is no guarantee.
> 
> Deployment issue ?
> 
> In a multi-tenant system you can not suddenly force all workloads to
> be TCP memcg charged. This has caused many OMG.

Let's discuss the above at the end.

> 
> Also, the current situation of maintaining two limits (memcg one, plus
> global tcp_memory_allocated) is very inefficient.

Agree.

> 
> If we trust memcg, then why have an expensive safety belt ?
> 
> With this series, we can finally use one or the other limit. This
> should have been done from day-0 really.

Same, I agree.

> 
> >
> > In addition please avoid adding a per-memcg knob. Why not have system
> > level setting for the decoupling. I would say start with a build time
> > config setting or boot parameter then if really needed we can discuss if
> > system level setting is needed which can be toggled at runtime though
> > there might be challenges there.
> 
> Built time or boot parameter ? I fail to see how it can be more convenient.

I think we agree on decoupling the global and memcg accounting of
network memory. I am still not clear on the need of per-memcg knob. From
the earlier comment, it seems like you want mix of jobs with memcg
limited network memory accounting and with global network accounting
running concurrently on a system. Is that correct?

I expect this state of jobs with different network accounting config
running concurrently is temporary while the migrationg from one to other
is happening. Please correct me if I am wrong.

My main concern with the memcg knob is that it is permanent and it
requires a hierarchical semantics. No need to add a permanent interface
for a temporary need and I don't see a clear hierarchical semantic for
this interface.

I am wondering if alternative approches for per-workload settings are
explore starting with BPF.