[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251105074917.94531-1-leon.huangfu@shopee.com>
Date: Wed, 5 Nov 2025 15:49:16 +0800
From: Leon Huang Fu <leon.huangfu@...pee.com>
To: linux-mm@...ck.org
Cc: hannes@...xchg.org,
mhocko@...nel.org,
roman.gushchin@...ux.dev,
shakeel.butt@...ux.dev,
muchun.song@...ux.dev,
akpm@...ux-foundation.org,
joel.granados@...nel.org,
jack@...e.cz,
laoar.shao@...il.com,
mclapinski@...gle.com,
kyle.meyer@....com,
corbet@....net,
lance.yang@...ux.dev,
leon.huangfu@...pee.com,
linux-doc@...r.kernel.org,
linux-kernel@...r.kernel.org,
cgroups@...r.kernel.org
Subject: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
On high-core count systems, memory cgroup statistics can become stale
due to per-CPU caching and deferred aggregation. Monitoring tools and
management applications sometimes need guaranteed up-to-date statistics
at specific points in time to make accurate decisions.
This patch adds write handlers to both memory.stat and memory.numa_stat
files to allow userspace to explicitly force an immediate flush of
memory statistics. When "1" is written to either file, it triggers
__mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
all pending statistics for the cgroup and its descendants.
The write operation validates the input and only accepts the value "1",
returning -EINVAL for any other input.
Usage example:
# Force immediate flush before reading critical statistics
echo 1 > /sys/fs/cgroup/mygroup/memory.stat
cat /sys/fs/cgroup/mygroup/memory.stat
This provides several benefits:
1. On-demand accuracy: Tools can flush only when needed, avoiding
continuous overhead
2. Targeted flushing: Allows flushing specific cgroups when precision
is required for particular workloads
3. Integration flexibility: Monitoring scripts can decide when to pay
the flush cost based on their specific accuracy requirements
The implementation is shared between cgroup v1 and v2 interfaces,
with memory_stat_write() providing the common validation and flush
logic. Both memory.stat and memory.numa_stat use the same write
handler since they both benefit from forcing accurate statistics.
Documentation is updated to reflect that these files are now read-write
instead of read-only, with clear explanation of the write behavior.
Signed-off-by: Leon Huang Fu <leon.huangfu@...pee.com>
---
v1 -> v2:
- Flush stats when write the file (per Michal).
- https://lore.kernel.org/linux-mm/20251104031908.77313-1-leon.huangfu@shopee.com/
Documentation/admin-guide/cgroup-v2.rst | 31 +++++++++++++++++--------
mm/memcontrol-v1.c | 2 ++
mm/memcontrol-v1.h | 1 +
mm/memcontrol.c | 13 +++++++++++
4 files changed, 37 insertions(+), 10 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 3345961c30ac..2a4a81d2cc2f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1337,7 +1337,7 @@ PAGE_SIZE multiple when read back.
cgroup is within its effective low boundary, the cgroup's
memory won't be reclaimed unless there is no reclaimable
memory available in unprotected cgroups.
- Above the effective low boundary (or
+ Above the effective low boundary (or
effective min boundary if it is higher), pages are reclaimed
proportionally to the overage, reducing reclaim pressure for
smaller overages.
@@ -1525,11 +1525,17 @@ The following nested keys are defined.
generated on this file reflects only the local events.
memory.stat
- A read-only flat-keyed file which exists on non-root cgroups.
+ A read-write flat-keyed file which exists on non-root cgroups.
- This breaks down the cgroup's memory footprint into different
- types of memory, type-specific details, and other information
- on the state and past events of the memory management system.
+ Reading this file breaks down the cgroup's memory footprint into
+ different types of memory, type-specific details, and other
+ information on the state and past events of the memory management
+ system.
+
+ Writing the value "1" to this file forces an immediate flush of
+ memory statistics for this cgroup and its descendants, improving
+ the accuracy of subsequent reads. Any other value will result in
+ an error.
All memory amounts are in bytes.
@@ -1786,11 +1792,16 @@ The following nested keys are defined.
cgroup is mounted with the memory_hugetlb_accounting option).
memory.numa_stat
- A read-only nested-keyed file which exists on non-root cgroups.
+ A read-write nested-keyed file which exists on non-root cgroups.
+
+ Reading this file breaks down the cgroup's memory footprint into
+ different types of memory, type-specific details, and other
+ information per node on the state of the memory management system.
- This breaks down the cgroup's memory footprint into different
- types of memory, type-specific details, and other information
- per node on the state of the memory management system.
+ Writing the value "1" to this file forces an immediate flush of
+ memory statistics for this cgroup and its descendants, improving
+ the accuracy of subsequent reads. Any other value will result in
+ an error.
This is useful for providing visibility into the NUMA locality
information within an memcg since the pages are allowed to be
@@ -2173,7 +2184,7 @@ of the two is enforced.
cgroup writeback requires explicit support from the underlying
filesystem. Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
+btrfs, f2fs, and xfs. On other filesystems, all writeback IOs are
attributed to the root cgroup.
There are inherent differences in memory and writeback management
diff --git a/mm/memcontrol-v1.c b/mm/memcontrol-v1.c
index 6eed14bff742..8cab6b52424b 100644
--- a/mm/memcontrol-v1.c
+++ b/mm/memcontrol-v1.c
@@ -2040,6 +2040,7 @@ struct cftype mem_cgroup_legacy_files[] = {
{
.name = "stat",
.seq_show = memory_stat_show,
+ .write_u64 = memory_stat_write,
},
{
.name = "force_empty",
@@ -2078,6 +2079,7 @@ struct cftype mem_cgroup_legacy_files[] = {
{
.name = "numa_stat",
.seq_show = memcg_numa_stat_show,
+ .write_u64 = memory_stat_write,
},
#endif
{
diff --git a/mm/memcontrol-v1.h b/mm/memcontrol-v1.h
index 6358464bb416..1c92d58330aa 100644
--- a/mm/memcontrol-v1.h
+++ b/mm/memcontrol-v1.h
@@ -29,6 +29,7 @@ void drain_all_stock(struct mem_cgroup *root_memcg);
unsigned long memcg_events(struct mem_cgroup *memcg, int event);
unsigned long memcg_page_state_output(struct mem_cgroup *memcg, int item);
int memory_stat_show(struct seq_file *m, void *v);
+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val);
void mem_cgroup_id_get_many(struct mem_cgroup *memcg, unsigned int n);
struct mem_cgroup *mem_cgroup_id_get_online(struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c34029e92bab..d6a5d872fbcb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
return 0;
}
+int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
+{
+ if (val != 1)
+ return -EINVAL;
+
+ if (css)
+ css_rstat_flush(css);
+
+ return 0;
+}
+
#ifdef CONFIG_NUMA
static inline unsigned long lruvec_page_state_output(struct lruvec *lruvec,
int item)
@@ -4666,11 +4677,13 @@ static struct cftype memory_files[] = {
{
.name = "stat",
.seq_show = memory_stat_show,
+ .write_u64 = memory_stat_write,
},
#ifdef CONFIG_NUMA
{
.name = "numa_stat",
.seq_show = memory_numa_stat_show,
+ .write_u64 = memory_stat_write,
},
#endif
{
--
2.51.2
Powered by blists - more mailing lists