[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251015190813.80163-1-inwardvessel@gmail.com>
Date: Wed, 15 Oct 2025 12:08:11 -0700
From: JP Kobryn <inwardvessel@...il.com>
To: shakeel.butt@...ux.dev,
andrii@...nel.org,
ast@...nel.org,
mkoutny@...e.com,
yosryahmed@...gle.com,
hannes@...xchg.org,
tj@...nel.org,
akpm@...ux-foundation.org
Cc: linux-kernel@...r.kernel.org,
cgroups@...r.kernel.org,
linux-mm@...ck.org,
bpf@...r.kernel.org,
kernel-team@...a.com
Subject: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
When reading cgroup memory.stat files there is significant kernel overhead
in the formatting and encoding of numeric data into a string buffer. Beyond
that, the given user mode program must decode this data and possibly
perform filtering to obtain the desired stats. This process can be
expensive for programs that periodically sample this data over a large
enough fleet.
As an alternative to reading memory.stat, introduce new kfuncs that allow
fetching specific memcg stats from within cgroup iterator based bpf
programs. This approach allows for numeric values to be transferred
directly from the kernel to user mode via the mapped memory of the bpf
program's elf data section. Reading stats this way effectively eliminates
the numeric conversion work needed to be performed in both kernel and user
mode. It also eliminates the need for filtering in a user mode program.
i.e. where reading memory.stat returns all stats, this new approach allows
returning only select stats.
An experiment was setup to compare the performance of a program using these
new kfuncs vs a program that uses the traditional method of reading
memory.stat. On the experimental side, a libbpf based program was written
which sets up a link to the bpf program once in advance and then reuses
this link to create and read from a bpf iterator program for 1M iterations.
Meanwhile on the control side, a program was written to open the root
memory.stat file and repeatedly read 1M times from the associated file
descriptor (while seeking back to zero before each subsequent read). Note
that the program does not bother to decode or filter any data in user mode.
The reason for this is because the experimental program completely removes
the need for this work.
The results showed a significant perf benefit on the experimental side,
outperforming the control side by a margin of 80% elapsed time in kernel
mode. The kernel overhead of numeric conversion on the control side is
eliminated on the experimental side since the values are read directly
through mapped memory of the bpf program. The experiment data is shown
here:
control: elapsed time
real 0m13.062s
user 0m0.147s
sys 0m12.876s
experiment: elapsed time
real 0m2.717s
user 0m0.175s
sys 0m2.451s
control: perf data
22.23% a.out [kernel.kallsyms] [k] vsnprintf
18.83% a.out [kernel.kallsyms] [k] format_decode
12.05% a.out [kernel.kallsyms] [k] string
11.56% a.out [kernel.kallsyms] [k] number
7.71% a.out [kernel.kallsyms] [k] strlen
4.80% a.out [kernel.kallsyms] [k] memcpy_orig
4.67% a.out [kernel.kallsyms] [k] memory_stat_format
4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
2.22% a.out [kernel.kallsyms] [k] widen_string
1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
0.95% a.out [kernel.kallsyms] [k] put_dec_full8
0.69% a.out [kernel.kallsyms] [k] put_dec
0.69% a.out [kernel.kallsyms] [k] memcpy
experiment: perf data
10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
Aside from the perf gain, the kfunc/bpf approach provides flexibility in
how memcg data can be delivered to a user mode program. As seen in the
second patch which contains the selftests, it is possible to use a struct
with select memory stat fields. But it is completely up to the programmer
on how to lay out the data.
JP Kobryn (2):
memcg: introduce kfuncs for fetching memcg stats
memcg: selftests for memcg stat kfuncs
mm/memcontrol.c | 67 ++++
.../testing/selftests/bpf/cgroup_iter_memcg.h | 18 ++
.../bpf/prog_tests/cgroup_iter_memcg.c | 294 ++++++++++++++++++
.../selftests/bpf/progs/cgroup_iter_memcg.c | 61 ++++
4 files changed, 440 insertions(+)
create mode 100644 tools/testing/selftests/bpf/cgroup_iter_memcg.h
create mode 100644 tools/testing/selftests/bpf/prog_tests/cgroup_iter_memcg.c
create mode 100644 tools/testing/selftests/bpf/progs/cgroup_iter_memcg.c
--
2.47.3
Powered by blists - more mailing lists