[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <CAPV86rrt0YT-npNSBJ_eHvAYdr_j1qkN7H+J4QLN8zsfi5TJ4w@mail.gmail.com>
Date: Wed, 5 Nov 2025 14:01:33 +0800
From: Leon Huang Fu <leon.huangfu@...pee.com>
To: Michal Hocko <mhocko@...e.com>
Cc: linux-mm@...ck.org, hannes@...xchg.org, roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev, muchun.song@...ux.dev, akpm@...ux-foundation.org,
joel.granados@...nel.org, jack@...e.cz, laoar.shao@...il.com,
mclapinski@...gle.com, kyle.meyer@....com, corbet@....net,
lance.yang@...ux.dev, linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org,
cgroups@...r.kernel.org
Subject: Re: [PATCH mm-new] mm/memcontrol: Introduce sysctl vm.memcg_stats_flush_threshold
On Tue, Nov 4, 2025 at 5:21 PM Michal Hocko <mhocko@...e.com> wrote:
>
> On Tue 04-11-25 11:19:08, Leon Huang Fu wrote:
> > The current implementation uses a flush threshold calculated as
> > MEMCG_CHARGE_BATCH * num_online_cpus() for determining when to
> > aggregate per-CPU memory cgroup statistics. On systems with high core
> > counts, this threshold can become very large (e.g., 64 * 256 = 16,384
> > on a 256-core system), leading to stale statistics when userspace reads
> > memory.stat files.
> >
> > This is particularly problematic for monitoring and management tools
> > that rely on reasonably fresh statistics, as they may observe data that
> > is thousands of updates out of date.
> >
> > Introduce a new sysctl, vm.memcg_stats_flush_threshold, that allows
> > administrators to override the flush threshold specifically for
> > userspace reads of memory.stat. When set to 0 (default), the behavior
> > remains unchanged, using the automatic calculation. When set to a
> > non-zero value, userspace reads will use the custom threshold for more
> > frequent flushing.
>
> How are admins supposed to know how to tune this? Wouldn't it make more
> sense to allow explicit flushing on write to the file? That would allow
> admins to implement their preferred accuracy tuning by writing to the file
> when the precision is required.
Thank you for the feedback. Let me clarify the use case and design rationale.
The threshold approach is intended for scenarios where administrators want to
improve accuracy for existing monitoring tools on high core-count systems. On
such systems, the default threshold (MEMCG_CHARGE_BATCH * num_cpus) can reach
16K+ updates, causing monitoring dashboards to display stale data.
Regarding tunability: while the exact threshold value requires some
understanding, the principle is straightforward - lower values mean fresher
stats but higher overhead. Administrators can start conservatively (e.g.,
1/4 of the default: num_cpus * 16) and adjust based on observed overhead.
Your suggestion about allowing writes to memory.stat to trigger explicit
flushing is interesting. Comparing the two approaches:
- Threshold (this patch):
- Administrator sets once system-wide via sysctl
- Affects all memory.stat reads automatically
- Tradeoff: harder to tune, always-on overhead
- Write-to-flush (your suggestion):
- Tools write to memory.stat before reading: echo 1 > memory.stat
- Per-cgroup, on-demand control
- Tradeoff: requires tool modifications, but more precise control
Actually, your approach may be more elegant - tools pay the flush cost only
when they need accuracy, rather than imposing a system-wide policy. The
write-to-flush pattern is also more discoverable and self-documenting.
Let me try your approach in the next revision.
Thanks,
Leon
>
> --
> Michal Hocko
> SUSE Labs
Powered by blists - more mailing lists