[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <51f1f343-c29f-49b5-8016-bbda4bc778a2@gmail.com>
Date: Mon, 10 Nov 2025 11:24:05 -0800
From: JP Kobryn <inwardvessel@...il.com>
To: Leon Huang Fu <leon.huangfu@...pee.com>
Cc: akpm@...ux-foundation.org, cgroups@...r.kernel.org, corbet@....net,
hannes@...xchg.org, jack@...e.cz, joel.granados@...nel.org,
kyle.meyer@....com, lance.yang@...ux.dev, laoar.shao@...il.com,
linux-doc@...r.kernel.org, linux-kernel@...r.kernel.org, linux-mm@...ck.org,
mclapinski@...gle.com, mhocko@...nel.org, muchun.song@...ux.dev,
roman.gushchin@...ux.dev, shakeel.butt@...ux.dev
Subject: Re: [PATCH mm-new v2] mm/memcontrol: Flush stats when write stat file
On 11/9/25 10:20 PM, Leon Huang Fu wrote:
> On Fri, Nov 7, 2025 at 1:02 AM JP Kobryn <inwardvessel@...il.com> wrote:
>>
>> On 11/4/25 11:49 PM, Leon Huang Fu wrote:
>>> On high-core count systems, memory cgroup statistics can become stale
>>> due to per-CPU caching and deferred aggregation. Monitoring tools and
>>> management applications sometimes need guaranteed up-to-date statistics
>>> at specific points in time to make accurate decisions.
>>>
>>> This patch adds write handlers to both memory.stat and memory.numa_stat
>>> files to allow userspace to explicitly force an immediate flush of
>>> memory statistics. When "1" is written to either file, it triggers
>>> __mem_cgroup_flush_stats(memcg, true), which unconditionally flushes
>>> all pending statistics for the cgroup and its descendants.
>>>
>>> The write operation validates the input and only accepts the value "1",
>>> returning -EINVAL for any other input.
>>>
>>> Usage example:
>>> # Force immediate flush before reading critical statistics
>>> echo 1 > /sys/fs/cgroup/mygroup/memory.stat
>>> cat /sys/fs/cgroup/mygroup/memory.stat
>>>
>>> This provides several benefits:
>>>
>>> 1. On-demand accuracy: Tools can flush only when needed, avoiding
>>> continuous overhead
>>>
>>> 2. Targeted flushing: Allows flushing specific cgroups when precision
>>> is required for particular workloads
>>
>> I'm curious about your use case. Since you mention required precision,
>> are you planning on manually flushing before every read?
>>
>
> Yes, for our use case, manual flushing before critical reads is necessary.
> We're going to run on high-core count servers (224-256 cores), where the
> per-CPU batching threshold (MEMCG_CHARGE_BATCH * num_online_cpus) can
> accumulate up to 16,384 events (on 256 cores) before an automatic flush is
> triggered. This means memory statistics can be likely stale, often exceeding
> acceptable tolerance for critical memory management decisions.
>
> Our monitoring tools don't need to flush on every read - only when making
> critical decisions like OOM adjustments, container placement, or resource
> limit enforcement. The opt-in nature of this mechanism allows us to pay the
> flush cost only when precision is truly required.
>
>>>
>>> 3. Integration flexibility: Monitoring scripts can decide when to pay
>>> the flush cost based on their specific accuracy requirements
>>
>> [...]
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index c34029e92bab..d6a5d872fbcb 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -4531,6 +4531,17 @@ int memory_stat_show(struct seq_file *m, void *v)
>>> return 0;
>>> }
>>>
>>> +int memory_stat_write(struct cgroup_subsys_state *css, struct cftype *cft, u64 val)
>>> +{
>>> + if (val != 1)
>>> + return -EINVAL;
>>> +
>>> + if (css)
>>> + css_rstat_flush(css);
>>
>> This is a kfunc. You can do this right now from a bpf program without
>> any kernel changes.
>>
>
> While css_rstat_flush() is indeed available as a BPF kfunc, the practical
> challenge is determining when to call it. The natural hook point would be
> memory_stat_show() using fentry, but this runs into a BPF verifier
> limitation: the function's 'struct seq_file *' argument doesn't provide a
> trusted path to obtain the 'struct cgroup_subsys_state *css' pointer
> required by css_rstat_flush().
Ok, I see this would only work on the css for base stats.
SEC("iter.s/cgroup")
int cgroup_memcg_query(struct bpf_iter__cgroup *ctx)
{
struct cgroup *cgrp = ctx->cgroup;
struct cgroup_subsys_state *css;
if (!cgrp)
return 1;
/* example of flushing css for base cpu stats
* css = container_of(cgrp, struct cgroup_subsys_state, cgroup);
* if (!css)
* return 1;
* css_rstat_flush(css);
*/
/* get css for memcg stats */
css = cgrp->subsys[memory_cgrp_id];
if (!css)
return 1;
css_rstat_flush(css); <- confirm untrusted pointer arg error
...
>
> I attempted to implement this via BPF (code below), but it fails
> verification because deriving the css pointer through
> seq->private->kn->parent->priv results in an untrusted scalar that the
> verifier rejects for the kfunc call:
>
> R1 invalid mem access 'scalar'
>
> The verifier error occurs because:
> 1. seq->private is rdonly_untrusted_mem
> 2. Dereferencing through kernfs_node internals produces untracked pointers
> 3. css_rstat_flush() requires a trusted css pointer per its kfunc definition
>
> A direct userspace interface (memory.stat_refresh) avoids these verifier
> limitations and provides a cleaner, more maintainable solution that doesn't
> require BPF expertise or complex workarounds.
This is subjective. After hearing more about your use case and how you
mention making critical decisions, you should have a look at the work
being done on BPF OOM [0][1]. I think you would benefit from this
series. Specifically for your case it provides the ability to flush
memcg on demand and also fetch stats.
[0]
https://lore.kernel.org/all/20251027231727.472628-1-roman.gushchin@linux.dev/
[1]
https://lore.kernel.org/all/20251027232206.473085-2-roman.gushchin@linux.dev/
Powered by blists - more mailing lists