[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <e102f50a-efa5-49b9-927a-506b7353bac0@gmail.com>
Date: Wed, 15 Oct 2025 17:21:46 -0700
From: JP Kobryn <inwardvessel@...il.com>
To: Shakeel Butt <shakeel.butt@...ux.dev>
Cc: andrii@...nel.org, ast@...nel.org, mkoutny@...e.com,
yosryahmed@...gle.com, hannes@...xchg.org, tj@...nel.org,
akpm@...ux-foundation.org, linux-kernel@...r.kernel.org,
cgroups@...r.kernel.org, linux-mm@...ck.org, bpf@...r.kernel.org,
kernel-team@...a.com, mhocko@...nel.org, roman.gushchin@...ux.dev,
muchun.song@...ux.dev
Subject: Re: [PATCH v2 0/2] memcg: reading memcg stats more efficiently
On 10/15/25 1:46 PM, Shakeel Butt wrote:
> Cc memcg maintainers.
>
> On Wed, Oct 15, 2025 at 12:08:11PM -0700, JP Kobryn wrote:
>> When reading cgroup memory.stat files there is significant kernel overhead
>> in the formatting and encoding of numeric data into a string buffer. Beyond
>> that, the given user mode program must decode this data and possibly
>> perform filtering to obtain the desired stats. This process can be
>> expensive for programs that periodically sample this data over a large
>> enough fleet.
>>
>> As an alternative to reading memory.stat, introduce new kfuncs that allow
>> fetching specific memcg stats from within cgroup iterator based bpf
>> programs. This approach allows for numeric values to be transferred
>> directly from the kernel to user mode via the mapped memory of the bpf
>> program's elf data section. Reading stats this way effectively eliminates
>> the numeric conversion work needed to be performed in both kernel and user
>> mode. It also eliminates the need for filtering in a user mode program.
>> i.e. where reading memory.stat returns all stats, this new approach allows
>> returning only select stats.
>>
>> An experiment was setup to compare the performance of a program using these
>> new kfuncs vs a program that uses the traditional method of reading
>> memory.stat. On the experimental side, a libbpf based program was written
>> which sets up a link to the bpf program once in advance and then reuses
>> this link to create and read from a bpf iterator program for 1M iterations.
>
> I am getting a bit confused on the terminology. You mentioned libbpf
> program, bpf program, link. Can you describe each of them? Think of
> explaining this to someone with no bpf background.
>
> (BTW Yonghong already explained to me these details but I wanted the
> commit message to be self explanatory).
No problem. I'll try to expand on those terms in v3.
>
>> Meanwhile on the control side, a program was written to open the root
>> memory.stat file
>
> How much activity was on the system? I imagine none because I don't see
> flushing in the perf profile. This experiment focuses on the
> non-flushing part of the memcg stats which is fine.
Right, at the time there was no custom workload running alongside the
tests.
>
>> and repeatedly read 1M times from the associated file
>> descriptor (while seeking back to zero before each subsequent read). Note
>> that the program does not bother to decode or filter any data in user mode.
>> The reason for this is because the experimental program completely removes
>> the need for this work.
>
> Hmm in your experiment is the control program doing the decode and/or
> filter or no? The last sentence in above para is confusing. Yes, the
> experiment program does not need to do the parsing or decoding in
> userspace but the control program needs to do that. If your control
> program is not doing it then you are under-selling your work.
The control does not perform decoding. But it's a good point. Let me add
decoding to the control side in v3.
>
>>
>> The results showed a significant perf benefit on the experimental side,
>> outperforming the control side by a margin of 80% elapsed time in kernel
>> mode. The kernel overhead of numeric conversion on the control side is
>> eliminated on the experimental side since the values are read directly
>> through mapped memory of the bpf program. The experiment data is shown
>> here:
>>
>> control: elapsed time
>> real 0m13.062s
>> user 0m0.147s
>> sys 0m12.876s
>>
>> experiment: elapsed time
>> real 0m2.717s
>> user 0m0.175s
>> sys 0m2.451s
>
> These numbers are really awesome.
:)
>
>>
>> control: perf data
>> 22.23% a.out [kernel.kallsyms] [k] vsnprintf
>> 18.83% a.out [kernel.kallsyms] [k] format_decode
>> 12.05% a.out [kernel.kallsyms] [k] string
>> 11.56% a.out [kernel.kallsyms] [k] number
>> 7.71% a.out [kernel.kallsyms] [k] strlen
>> 4.80% a.out [kernel.kallsyms] [k] memcpy_orig
>> 4.67% a.out [kernel.kallsyms] [k] memory_stat_format
>> 4.63% a.out [kernel.kallsyms] [k] seq_buf_printf
>> 2.22% a.out [kernel.kallsyms] [k] widen_string
>> 1.65% a.out [kernel.kallsyms] [k] put_dec_trunc8
>> 0.95% a.out [kernel.kallsyms] [k] put_dec_full8
>> 0.69% a.out [kernel.kallsyms] [k] put_dec
>> 0.69% a.out [kernel.kallsyms] [k] memcpy
>>
>> experiment: perf data
>> 10.04% memcgstat bpf_prog_.._query [k] bpf_prog_527781c811d5b45c_query
>> 7.85% memcgstat [kernel.kallsyms] [k] memcg_node_stat_fetch
>> 4.03% memcgstat [kernel.kallsyms] [k] __memcg_slab_post_alloc_hook
>> 3.47% memcgstat [kernel.kallsyms] [k] _raw_spin_lock
>> 2.58% memcgstat [kernel.kallsyms] [k] memcg_vm_event_fetch
>> 2.58% memcgstat [kernel.kallsyms] [k] entry_SYSRETQ_unsafe_stack
>> 2.32% memcgstat [kernel.kallsyms] [k] kmem_cache_free
>> 2.19% memcgstat [kernel.kallsyms] [k] __memcg_slab_free_hook
>> 2.13% memcgstat [kernel.kallsyms] [k] mutex_lock
>> 2.12% memcgstat [kernel.kallsyms] [k] get_page_from_freelist
>>
>> Aside from the perf gain, the kfunc/bpf approach provides flexibility in
>> how memcg data can be delivered to a user mode program. As seen in the
>> second patch which contains the selftests, it is possible to use a struct
>> with select memory stat fields. But it is completely up to the programmer
>> on how to lay out the data.
>
> I remember you plan to convert couple of open source program to use this
> new feature. I think below [1] and oomd [2]. Adding that information
> would further make your case strong. cAdvisor[3] is another open source
> tool which can take benefit from this work.
That is accurate, thanks. Will include in v3.
>
> [1] https://github.com/facebookincubator/below
> [2] https://github.com/facebookincubator/oomd
> [3] https://github.com/google/cadvisor
>
Powered by blists - more mailing lists