linux-kernel - [PATCH mm-new] mm/memcontrol: Introduce sysctl vm.memcg_stats_flush

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251104031908.77313-1-leon.huangfu@shopee.com>
Date: Tue,  4 Nov 2025 11:19:08 +0800
From: Leon Huang Fu <leon.huangfu@...pee.com>
To: linux-mm@...ck.org
Cc: hannes@...xchg.org,
	mhocko@...nel.org,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev,
	muchun.song@...ux.dev,
	akpm@...ux-foundation.org,
	joel.granados@...nel.org,
	jack@...e.cz,
	laoar.shao@...il.com,
	mclapinski@...gle.com,
	kyle.meyer@....com,
	corbet@....net,
	lance.yang@...ux.dev,
	leon.huangfu@...pee.com,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	cgroups@...r.kernel.org
Subject: [PATCH mm-new] mm/memcontrol: Introduce sysctl vm.memcg_stats_flush_threshold

The current implementation uses a flush threshold calculated as
MEMCG_CHARGE_BATCH * num_online_cpus() for determining when to
aggregate per-CPU memory cgroup statistics. On systems with high core
counts, this threshold can become very large (e.g., 64 * 256 = 16,384
on a 256-core system), leading to stale statistics when userspace reads
memory.stat files.

This is particularly problematic for monitoring and management tools
that rely on reasonably fresh statistics, as they may observe data that
is thousands of updates out of date.

Introduce a new sysctl, vm.memcg_stats_flush_threshold, that allows
administrators to override the flush threshold specifically for
userspace reads of memory.stat. When set to 0 (default), the behavior
remains unchanged, using the automatic calculation. When set to a
non-zero value, userspace reads will use the custom threshold for more
frequent flushing.

Importantly, this change only affects userspace paths. Internal kernel
paths continue to use the default threshold (or ratelimited flushing)
to maintain optimal performance. This is achieved by:

- Introducing mem_cgroup_flush_stats_user() for userspace reads
- Keeping mem_cgroup_flush_stats() unchanged for kernel internal paths
- Updating memory.stat read paths to use mem_cgroup_flush_stats_user()

The implementation adds comprehensive documentation in
Documentation/admin-guide/sysctl/vm.rst explaining the use cases,
examples for different system configurations, and the distinction
between userspace and kernel flush behaviors.

Signed-off-by: Leon Huang Fu <leon.huangfu@...pee.com>
---
 Documentation/admin-guide/sysctl/vm.rst | 48 ++++++++++++++
 include/linux/memcontrol.h              |  1 +
 mm/memcontrol-v1.c                      |  4 +-
 mm/memcontrol.c                         | 86 +++++++++++++++++++++----
 4 files changed, 124 insertions(+), 15 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index ace73480eb9d..f40c629413ea 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -46,6 +46,7 @@ Currently, these files are in /proc/sys/vm:
 - lowmem_reserve_ratio
 - max_map_count
 - mem_profiling         (only if CONFIG_MEM_ALLOC_PROFILING=y)
+- memcg_stats_flush_threshold
 - memory_failure_early_kill
 - memory_failure_recovery
 - min_free_kbytes
@@ -515,6 +516,53 @@ memory allocations.
 The default value depends on CONFIG_MEM_ALLOC_PROFILING_ENABLED_BY_DEFAULT.
 
 
+memcg_stats_flush_threshold
+============================
+
+Control the threshold for flushing memory cgroup statistics when reading
+memory.stat from userspace. Memory cgroup stats are updated frequently in
+per-CPU counters, but these updates need to be periodically aggregated
+(flushed) to provide accurate statistics.
+
+**Important**: This setting ONLY affects userspace reads of memory.stat files.
+Internal kernel paths continue to use the default threshold (or ratelimited
+flushing) to maintain optimal performance in latency-sensitive code paths.
+
+When set to 0 (default), userspace reads use the automatic threshold:
+MEMCG_CHARGE_BATCH * num_online_cpus()
+
+This means on systems with many CPU cores, the threshold can become very high
+(e.g., 64 * 256 = 16,384 updates on a 256-core system), potentially resulting
+in stale statistics when reading memory.stat.
+
+Setting this to a non-zero value overrides the automatic calculation for
+userspace reads only. Lower values result in fresher statistics when reading
+memory.stat but may increase overhead due to more frequent flushing.
+
+Examples:
+
+- On a 256-core system with default (0):
+  Userspace reads use threshold = 64 * 256 = 16,384 updates
+  Internal kernel paths use default thresholds (unaffected)
+
+- Setting to 2048:
+  Userspace reads use threshold = 2,048 updates (much fresher stats)
+  Internal kernel paths use default thresholds (performance maintained)
+
+- Setting to 1024:
+  Userspace reads use threshold = 1,024 updates (even fresher stats)
+  Internal kernel paths use default thresholds (performance maintained)
+
+Note: Memory cgroup statistics are also flushed automatically every 2 seconds
+regardless of this threshold.
+
+Recommended for systems with high core counts where the default threshold
+results in statistics that are too stale for monitoring or management tools,
+while keeping internal kernel operations performant.
+
+Default: 0 (auto-calculate based on CPU count)
+
+
 memory_failure_early_kill
 =========================
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 8d2e250535a8..208895e6cf14 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -955,6 +955,7 @@ unsigned long lruvec_page_state_local(struct lruvec *lruvec,
 				      enum node_stat_item idx);
 
 void mem_cgroup_flush_stats(struct mem_cgroup *memcg);
+void mem_cgroup_flush_stats_user(struct mem_cgroup *memcg);
 void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg);
 
 void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val);
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..3eeb20f6c5ad 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -1792,7 +1792,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v)
 	int nid;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats(memcg);
+	mem_cgroup_flush_stats_user(memcg);
 
 	for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) {
 		seq_printf(m, "%s=%lu", stat->name,
@@ -1873,7 +1873,7 @@ void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 
 	BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats));
 
-	mem_cgroup_flush_stats(memcg);
+	mem_cgroup_flush_stats_user(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) {
 		unsigned long nr;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c34029e92bab..fffcf6518ae0 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -63,6 +63,7 @@
 #include <linux/seq_buf.h>
 #include <linux/sched/isolation.h>
 #include <linux/kmemleak.h>
+#include <linux/sysctl.h>
 #include "internal.h"
 #include <net/sock.h>
 #include <net/ip.h>
@@ -556,10 +557,40 @@ static u64 flush_last_time;
 
 #define FLUSH_TIME (2UL*HZ)
 
-static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats)
+#define FLUSH_DEFAULT_THRESHOLD (MEMCG_CHARGE_BATCH * num_online_cpus())
+
+/*
+ * Threshold for number of stat updates before triggering a flush.
+ *
+ * Default: 0
+ *   - When set to 0 (the default), the threshold is calculated as:
+ *         FLUSH_DEFAULT_THRESHOLD
+ *     (i.e. MEMCG_CHARGE_BATCH * num_online_cpus())
+ *
+ * Tunable:
+ *   - This value can be overridden at runtime using the sysctl:
+ *         /proc/sys/vm/memcg_stats_flush_threshold
+ *   - Useful for systems with many CPU cores, where the default threshold may
+ *     result in stale stats; a lower value leads to more frequent flushing.
+ */
+static int memcg_stats_flush_threshold __read_mostly;
+
+#ifdef CONFIG_SYSCTL
+static const struct ctl_table memcg_sysctl_table[] = {
+	{
+		.procname	= "memcg_stats_flush_threshold",
+		.data		= &memcg_stats_flush_threshold,
+		.maxlen		= sizeof(memcg_stats_flush_threshold),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= SYSCTL_ZERO,
+	},
+};
+#endif
+
+static bool memcg_vmstats_needs_flush(struct memcg_vmstats *vmstats, int threshold)
 {
-	return atomic_read(&vmstats->stats_updates) >
-		MEMCG_CHARGE_BATCH * num_online_cpus();
+	return atomic_read(&vmstats->stats_updates) > threshold;
 }
 
 static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
@@ -581,7 +612,7 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
 		 * flushable as well and also there is no need to increase
 		 * stats_updates.
 		 */
-		if (memcg_vmstats_needs_flush(statc->vmstats))
+		if (memcg_vmstats_needs_flush(statc->vmstats, FLUSH_DEFAULT_THRESHOLD))
 			break;
 
 		stats_updates = this_cpu_add_return(statc_pcpu->stats_updates,
@@ -594,9 +625,9 @@ static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val,
 	}
 }
 
-static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
+static void __mem_cgroup_flush_stats_threshold(struct mem_cgroup *memcg, bool force, int threshold)
 {
-	bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats);
+	bool needs_flush = memcg_vmstats_needs_flush(memcg->vmstats, threshold);
 
 	trace_memcg_flush_stats(memcg, atomic_read(&memcg->vmstats->stats_updates),
 		force, needs_flush);
@@ -610,6 +641,20 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
 	css_rstat_flush(&memcg->css);
 }
 
+static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
+{
+	__mem_cgroup_flush_stats_threshold(memcg, force, FLUSH_DEFAULT_THRESHOLD);
+}
+
+static void mem_cgroup_flush_stats_threshold(struct mem_cgroup *memcg, int threshold)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	memcg = memcg ? : root_mem_cgroup;
+	__mem_cgroup_flush_stats_threshold(memcg, false, threshold);
+}
+
 /*
  * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree
  * @memcg: root of the subtree to flush
@@ -621,13 +666,24 @@ static void __mem_cgroup_flush_stats(struct mem_cgroup *memcg, bool force)
  */
 void mem_cgroup_flush_stats(struct mem_cgroup *memcg)
 {
-	if (mem_cgroup_disabled())
-		return;
+	mem_cgroup_flush_stats_threshold(memcg, FLUSH_DEFAULT_THRESHOLD);
+}
 
-	if (!memcg)
-		memcg = root_mem_cgroup;
+/*
+ * mem_cgroup_flush_stats_user - flush stats when reading memory.stat from userspace
+ * @memcg: root of the subtree to flush
+ *
+ * This function uses a potentially custom threshold set via sysctl
+ * (memcg_stats_flush_threshold). It should only be used for userspace reads
+ * of memory.stat where fresher stats are desired. Internal kernel paths
+ * should use mem_cgroup_flush_stats() to maintain performance.
+ */
+void mem_cgroup_flush_stats_user(struct mem_cgroup *memcg)
+{
+	int threshold = READ_ONCE(memcg_stats_flush_threshold);
 
-	__mem_cgroup_flush_stats(memcg, false);
+	threshold = threshold ? : FLUSH_DEFAULT_THRESHOLD;
+	mem_cgroup_flush_stats_threshold(memcg, threshold);
 }
 
 void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg)
@@ -1474,7 +1530,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s)
 	 *
 	 * Current memory state:
 	 */
-	mem_cgroup_flush_stats(memcg);
+	mem_cgroup_flush_stats_user(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		u64 size;
@@ -4544,7 +4600,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v)
 	int i;
 	struct mem_cgroup *memcg = mem_cgroup_from_seq(m);
 
-	mem_cgroup_flush_stats(memcg);
+	mem_cgroup_flush_stats_user(memcg);
 
 	for (i = 0; i < ARRAY_SIZE(memory_stats); i++) {
 		int nid;
@@ -5176,6 +5232,10 @@ int __init mem_cgroup_init(void)
 	memcg_pn_cachep = KMEM_CACHE(mem_cgroup_per_node,
 				     SLAB_PANIC | SLAB_HWCACHE_ALIGN);
 
+#ifdef CONFIG_SYSCTL
+	register_sysctl_init("vm", memcg_sysctl_table);
+#endif
+
 	return 0;
 }
 
-- 
2.51.2