linux-kernel - Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <uxpsukgoj5y4ex2sj57ujxxcnu7siez2hslf7ftoy6liifv6v5@jzehpby6h2ps>
Date: Wed, 15 Oct 2025 13:46:04 -0700
From: Shakeel Butt <shakeel.butt@...ux.dev>
To: JP Kobryn <inwardvessel@...il.com>
Cc: andrii@...nel.org, ast@...nel.org, mkoutny@...e.com, 
	yosryahmed@...gle.com, hannes@...xchg.org, tj@...nel.org, akpm@...ux-foundation.org, 
	linux-kernel@...r.kernel.org, cgroups@...r.kernel.org, linux-mm@...ck.org, bpf@...r.kernel.org, 
	kernel-team@...a.com, mhocko@...nel.org, roman.gushchin@...ux.dev, 
	muchun.song@...ux.dev
Subject: Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently

Cc memcg maintainers.

On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
> When reading cgroup memory.stat files there is significant kernel overhead
> in the formatting and encoding of numeric data into a string buffer. Beyond
> that, the given user mode program must decode this data and possibly
> perform filtering to obtain the desired stats. This process can be
> expensive for programs that periodically sample this data over a large
> enough fleet.
> 
> As an alternative to reading memory.stat, introduce new kfuncs that allow
> fetching specific memcg stats from within cgroup iterator based bpf
> programs. This approach allows for numeric values to be transferred
> directly from the kernel to user mode via the mapped memory of the bpf
> program's elf data section. Reading stats this way effectively eliminates
> the numeric conversion work needed to be performed in both kernel and user
> mode. It also eliminates the need for filtering in a user mode program.
> i.e. where reading memory.stat returns all stats, this new approach allows
> returning only select stats.
> 
> An experiment was setup to compare the performance of a program using these
> new kfuncs vs a program that uses the traditional method of reading
> memory.stat. On the experimental side, a libbpf based program was written
> which sets up a link to the bpf program once in advance and then reuses
> this link to create and read from a bpf iterator program for 1M iterations.

I am getting a bit confused on the terminology. You mentioned libbpf
program, bpf program, link. Can you describe each of them? Think of
explaining this to someone with no bpf background.

(BTW Yonghong already explained to me these details but I wanted the
commit message to be self explanatory).

> Meanwhile on the control side, a program was written to open the root
> memory.stat file

How much activity was on the system? I imagine none because I don't see
flushing in the perf profile. This experiment focuses on the
non-flushing part of the memcg stats which is fine.

> and repeatedly read 1M times from the associated file
> descriptor (while seeking back to zero before each subsequent read). Note
> that the program does not bother to decode or filter any data in user mode.
> The reason for this is because the experimental program completely removes
> the need for this work.

Hmm in your experiment is the control program doing the decode and/or
filter or no? The last sentence in above para is confusing. Yes, the
experiment program does not need to do the parsing or decoding in
userspace but the control program needs to do that. If your control
program is not doing it then you are under-selling your work.

> 
> The results showed a significant perf benefit on the experimental side,
> outperforming the control side by a margin of 80% elapsed time in kernel
> mode. The kernel overhead of numeric conversion on the control side is
> eliminated on the experimental side since the values are read directly
> through mapped memory of the bpf program. The experiment data is shown
> here:
> 
> control: elapsed time
> real    0m13.062s
> user    0m0.147s
> sys     0m12.876s
> 
> experiment: elapsed time
> real    0m2.717s
> user    0m0.175s
> sys     0m2.451s

These numbers are really awesome.

> 
> control: perf data
> 22.23% a.out [kernel.kallsyms] [k] vsnprintf
> 18.83% a.out [kernel.kallsyms] [k] format_decode
> 12.05% a.out [kernel.kallsyms] [k] string
> 11.56% a.out [kernel.kallsyms] [k] number
>  7.71% a.out [kernel.kallsyms] [k] strlen
>  4.80% a.out [kernel.kallsyms] [k] memcpy_orig
>  4.67% a.out [kernel.kallsyms] [k] memory_stat_format
>  4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
>  2.22% a.out [kernel.kallsyms] [k] widen_string
>  1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
>  0.95% a.out [kernel.kallsyms] [k] put_dec_full8
>  0.69% a.out [kernel.kallsyms] [k] put_dec
>  0.69% a.out [kernel.kallsyms] [k] memcpy
> 
> experiment: perf data
> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
>  7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
>  4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
>  3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
>  2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
>  2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
>  2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
>  2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
>  2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
>  2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
> 
> Aside from the perf gain, the kfunc/bpf approach provides flexibility in
> how memcg data can be delivered to a user mode program. As seen in the
> second patch which contains the selftests, it is possible to use a struct
> with select memory stat fields. But it is completely up to the programmer
> on how to lay out the data.

I remember you plan to convert couple of open source program to use this
new feature. I think below [1] and oomd [2]. Adding that information
would further make your case strong. cAdvisor[3] is another open source
tool which can take benefit from this work.

[1] https://github.com/facebookincubator/below
[2] https://github.com/facebookincubator/oomd
[3] https://github.com/google/cadvisor