[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <dpnv6luzeby3wni3jlcv2utgx4ozfp5zl3zfnhn2shv3q4iejz@sbex7f6azcpc>
Date: Fri, 7 Mar 2025 12:12:55 -0800
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: Yosry Ahmed <yosry.ahmed@...ux.dev>
Cc: Andrew Morton <akpm@...ux-foundation.org>,
JP Kobryn <inwardvessel@...il.com>, Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>, Roman Gushchin <roman.gushchin@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>, "David S . Miller" <davem@...emloft.net>,
Eric Dumazet <edumazet@...gle.com>, Jakub Kicinski <kuba@...nel.org>,
Paolo Abeni <pabeni@...hat.com>, netdev@...r.kernel.org, linux-mm@...ck.org,
cgroups@...r.kernel.org, linux-kernel@...r.kernel.org,
Meta kernel team <kernel-team@...a.com>
Subject: Re: [RFC PATCH] memcg: net: improve charging of incoming network
traffic
On Fri, Mar 07, 2025 at 07:41:59PM +0000, Yosry Ahmed wrote:
> On Thu, Mar 06, 2025 at 09:59:36PM -0800, Shakeel Butt wrote:
> > Memory cgroup accounting is expensive and to reduce the cost, the kernel
> > maintains per-cpu charge cache for a single memcg. So, if a charge
> > request comes for a different memcg, the kernel will flush the old
> > memcg's charge cache and then charge the newer memcg a fixed amount (64
> > pages), subtracts the charge request amount and stores the remaining in
> > the per-cpu charge cache for the newer memcg.
> >
> > This mechanism is based on the assumption that the kernel, for locality,
> > keep a process on a CPU for long period of time and most of the charge
> > requests from that process will be served by that CPU's local charge
> > cache.
> >
> > However this assumption breaks down for incoming network traffic in a
> > multi-tenant machine. We are in the process of running multiple
> > workloads on a single machine and if such workloads are network heavy,
> > we are seeing very high network memory accounting cost. We have observed
> > multiple CPUs spending almost 100% of their time in net_rx_action and
> > almost all of that time is spent in memcg accounting of the network
> > traffic.
> >
> > More precisely, net_rx_action is serving packets from multiple workloads
> > and is observing/serving mix of packets of these workloads. The memcg
> > switch of per-cpu cache is very expensive and we are observing a lot of
> > memcg switches on the machine. Almost all the time is being spent on
> > charging new memcg and flushing older memcg cache. So, definitely we
> > need per-cpu cache that support multiple memcgs for this scenario.
>
> We've internally faced a different situation on machines with a large
> number of CPUs where the mod_memcg_state(MEMCG_SOCK) call in
> mem_cgroup_[un]charge_skmem() causes latency due to high contention on
> the atomic update in memcg_rstat_updated().
Interesting. At Meta, we are not seeing the latency issue due to
memcg_rstat_updated() but it is one of most expensive function in our
fleet and optimizing it is in our plan.
>
> In this case, networking performs a lot of charge/uncharge operations,
> but because we count the absolute magnitude updates in
> memcg_rstat_updated(), we reach the threshold quickly. In practice, a
> lot of these updates cancel each other out so the net change in the
> stats may not be that large.
>
> However, not using the absolute value of the updates could cause stat
> updates of irrelevant stats with opposite polarity to cancel out,
> potentially delaying stat updates.
>
> I wonder if we can leverage the batching introduced here to fix this
> problem as well. For example, if the charging in
> mem_cgroup_[un]charge_skmem() is satisfied from this catch, can we avoid
> mod_memcg_state() and only update the stats once at the end of batching?
>
> IIUC the current implementation only covers the RX path, so it will
> reduce the number of calls to mod_memcg_state(), but it won't prevent
> charge/uncharge operations from raising the update counter
> unnecessarily. I wonder if the scope of the batching could be increased
> so that both TX and RX use the same cache, and charge/uncharge
> operations cancel out completely in terms of stat updates.
>
> WDYT?
JP (CCed) is currently working on collecting data from our fleet to find
the hotest memcg stats i.e. with the most updates. I think the early
data show MEMCG_SOCK and MEMCG_KMEM are among the hot ones. JP has
couple of ideas to improve the situation here which he will experiment
with and share in due time.
Regarding batching for TX and RX, my intention is to keep the charge
batching general purpose but I think the batching the MEMCG_SOCK for
networking with a scoping API can be done and seems like a good idea. I
will do that in the followup.
Thanks for taking a look.
Powered by blists - more mailing lists