linux-kernel - Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251110062053.83754-1-leon.huangfu@shopee.com>
Date: Mon, 10 Nov 2025 14:20:53 +0800
From: Leon Huang Fu <leon.huangfu@...pee.com>
To: inwardvessel@...il.com
Cc: akpm@...ux-foundation.org,
	cgroups@...r.kernel.org,
	corbet@....net,
	hannes@...xchg.org,
	jack@...e.cz,
	joel.granados@...nel.org,
	kyle.meyer@....com,
	lance.yang@...ux.dev,
	laoar.shao@...il.com,
	leon.huangfu@...pee.com,
	linux-doc@...r.kernel.org,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	mclapinski@...gle.com,
	mhocko@...nel.org,
	muchun.song@...ux.dev,
	roman.gushchin@...ux.dev,
	shakeel.butt@...ux.dev
Subject: Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file

On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@...il.com> wrote:
>
> On 11/4/25 11:49 PM, Leon Huang Fu wrote:
> > On high-core count systems, memory cgroup statistics can become stale
> > due to per-CPU caching and deferred aggregation. Monitoring tools and
> > management applications sometimes need guaranteed up-to-date statistics
> > at specific points in time to make accurate decisions.
> >
> > This patch adds write handlers to both memory.stat and memory.numa_stat
> > files to allow userspace to explicitly force an immediate flush of
> > memory statistics. When "1" is written to either file, it triggers
> > __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
> > all pending statistics for the cgroup and its descendants.
> >
> > The write operation validates the input and only accepts the value "1",
> > returning -EINVAL for any other input.
> >
> > Usage example:
> >    # Force immediate flush before reading critical statistics
> >    echo 1 > /sys/fs/cgroup/mygroup/memory.stat
> >    cat /sys/fs/cgroup/mygroup/memory.stat
> >
> > This provides several benefits:
> >
> > 1. On-demand accuracy: Tools can flush only when needed, avoiding
> >     continuous overhead
> >
> > 2. Targeted flushing: Allows flushing specific cgroups when precision
> >     is required for particular workloads
>
> I'm curious about your use case. Since you mention required precision,
> are you planning on manually flushing before every read?
>

Yes, for our use case, manual flushing before critical reads is necessary.
We're going to run on high-core count servers (224-256 cores), where the
per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
accumulate up to 16,384 events (on 256 cores) before an automatic flush is
triggered. This means memory statistics can be likely stale, often exceeding
acceptable tolerance for critical memory management decisions.

Our monitoring tools don't need to flush on every read - only when making
critical decisions like OOM adjustments, container placement, or resource
limit enforcement. The opt-in nature of this mechanism allows us to pay the
flush cost only when precision is truly required.

> >
> > 3. Integration flexibility: Monitoring scripts can decide when to pay
> >     the flush cost based on their specific accuracy requirements
>
> [...]
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index c34029e92bab..d6a5d872fbcb 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
> >       return 0;
> >   }
> >
> > +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
> > +{
> > +     if (val != 1)
> > +             return -EINVAL;
> > +
> > +     if (css)
> > +             css_rstat_flush(css);
>
> This is a kfunc. You can do this right now from a bpf program without
> any kernel changes.
>

While css_rstat_flush() is indeed available as a BPF kfunc, the practical
challenge is determining when to call it. The natural hook point would be
memory_stat_show() using fentry, but this runs into a BPF verifier
limitation: the function's 'struct seq_file *' argument doesn't provide a
trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
required by css_rstat_flush().

I attempted to implement this via BPF (code below), but it fails
verification because deriving the css pointer through
seq->private->kn->parent->priv results in an untrusted scalar that the
verifier rejects for the kfunc call:

    R1 invalid mem access 'scalar'

The verifier error occurs because:
1. seq->private is rdonly_untrusted_mem
2. Dereferencing through kernfs_node internals produces untracked pointers
3. css_rstat_flush() requires a trusted css pointer per its kfunc definition

A direct userspace interface (memory.stat_refresh) avoids these verifier
limitations and provides a cleaner, more maintainable solution that doesn't
require BPF expertise or complex workarounds.

Thanks,
Leon

---


#include "vmlinux.h"

#include "bpf_helpers.h"
#include "bpf_tracing.h"

char _license[] SEC("license") = "GPL";

extern void css_rstat_flush(struct cgroup_subsys_state *css) __weak __ksym;

static inline struct cftype *of_cft(struct kernfs_open_file *of)
{
	return of->kn->priv;
}

struct cgroup_subsys_state *of_css(struct kernfs_open_file *of)
{
	struct cgroup *cgrp = of->kn->parent->priv;
	struct cftype *cft = of_cft(of);

	/*
	 * This is open and unprotected implementation of cgroup_css().
	 * seq_css() is only called from a kernfs file operation which has
	 * an active reference on the file.  Because all the subsystem
	 * files are drained before a css is disassociated with a cgroup,
	 * the matching css from the cgroup's subsys table is guaranteed to
	 * be and stay valid until the enclosing operation is complete.
	 */
	if (cft->ss)
		return cgrp->subsys[cft->ss->id];
	else
		return &cgrp->self;
}

static inline struct cgroup_subsys_state *seq_css(struct seq_file *seq)
{
	return of_css(seq->private);
}

SEC("fentry/memory_stat_show")
int BPF_PROG(memory_stat_show, struct seq_file *seq, void *v)
{
	struct cgroup_subsys_state *css = seq_css(seq);

	if (css)
		css_rstat_flush(css);

	return 0;
}