[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1769157382.git.zhuhui@kylinos.cn>
Date: Fri, 23 Jan 2026 16:55:18 +0800
From: Hui Zhu <hui.zhu@...ux.dev>
To: Andrew Morton <akpm@...ux-foundation.org>,
Johannes Weiner <hannes@...xchg.org>,
Michal Hocko <mhocko@...nel.org>,
Roman Gushchin <roman.gushchin@...ux.dev>,
Shakeel Butt <shakeel.butt@...ux.dev>,
Muchun Song <muchun.song@...ux.dev>,
Alexei Starovoitov <ast@...nel.org>,
Daniel Borkmann <daniel@...earbox.net>,
Andrii Nakryiko <andrii@...nel.org>,
Martin KaFai Lau <martin.lau@...ux.dev>,
Eduard Zingerman <eddyz87@...il.com>,
Song Liu <song@...nel.org>,
Yonghong Song <yonghong.song@...ux.dev>,
John Fastabend <john.fastabend@...il.com>,
KP Singh <kpsingh@...nel.org>,
Stanislav Fomichev <sdf@...ichev.me>,
Hao Luo <haoluo@...gle.com>,
Jiri Olsa <jolsa@...nel.org>,
Shuah Khan <shuah@...nel.org>,
Peter Zijlstra <peterz@...radead.org>,
Miguel Ojeda <ojeda@...nel.org>,
Nathan Chancellor <nathan@...nel.org>,
Kees Cook <kees@...nel.org>,
Tejun Heo <tj@...nel.org>,
Jeff Xu <jeffxu@...omium.org>,
mkoutny@...e.com,
Jan Hendrik Farr <kernel@...rr.cc>,
Christian Brauner <brauner@...nel.org>,
Randy Dunlap <rdunlap@...radead.org>,
Brian Gerst <brgerst@...il.com>,
Masahiro Yamada <masahiroy@...nel.org>,
davem@...emloft.net,
Jakub Kicinski <kuba@...nel.org>,
Jesper Dangaard Brouer <hawk@...nel.org>,
JP Kobryn <inwardvessel@...il.com>,
Willem de Bruijn <willemb@...gle.com>,
Jason Xing <kerneljasonxing@...il.com>,
Paul Chaignon <paul.chaignon@...il.com>,
Anton Protopopov <a.s.protopopov@...il.com>,
Amery Hung <ameryhung@...il.com>,
Chen Ridong <chenridong@...weicloud.com>,
Lance Yang <lance.yang@...ux.dev>,
Jiayuan Chen <jiayuan.chen@...ux.dev>,
linux-kernel@...r.kernel.org,
linux-mm@...ck.org,
cgroups@...r.kernel.org,
bpf@...r.kernel.org,
netdev@...r.kernel.org,
linux-kselftest@...r.kernel.org
Cc: Hui Zhu <zhuhui@...inos.cn>
Subject: [RFC PATCH bpf-next v3 00/12] mm: memcontrol: Add BPF hooks for memory controller
From: Hui Zhu <zhuhui@...inos.cn>
eBPF infrastructure provides rich visibility into system performance
metrics through various tracepoints and statistics.
This patch series introduces BPF struct_ops for the memory controller.
Then the eBPF program can help the system control the memory controller
based on system performance metrics, thereby improving the utilization
of system memory resources while ensuring memory limits are respected.
The following example illustrates how memcg eBPF can improve memory
utilization in some scenarios.
The example running on x86_64 QEMU (10 CPUs, 4GB RAM), using a
file in tmpfs on the host as a swap device to reduce I/O impact.
root@...ntu:~# cat /proc/sys/vm/swappiness
60
This the high priority memcg.
root@...ntu:~# mkdir /sys/fs/cgroup/high
This the low priority memcg.
root@...ntu:~# mkdir /sys/fs/cgroup/low
root@...ntu:~# free
total used free shared buff/cache available
Mem: 4007276 392320 3684940 908 101476 3614956
Swap: 10485756 0 10485756
First, The following test uses memory.low to reduce the likelihood of tasks in
high-priority memory cgroups being reclaimed.
root@...ntu:~# echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60
[1] 1176
stress-ng: info: [1177] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1176] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1177] dispatching hogs: 4 vm
stress-ng: info: [1176] dispatching hogs: 4 vm
stress-ng: metrc: [1177] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1177] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1177] vm 27047770 60.07 217.79 8.87 450289.91 119330.63 94.34 886936
stress-ng: info: [1177] skipped: 0
stress-ng: info: [1177] passed: 4: vm (4)
stress-ng: info: [1177] failed: 0
stress-ng: info: [1177] metrics untrustworthy: 0
stress-ng: info: [1177] successful run completed in 1 min, 0.07 secs
stress-ng: metrc: [1176] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1176] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1176] vm 679754 60.12 11.82 72.78 11307.18 8034.42 35.18 469884
stress-ng: info: [1176] skipped: 0
stress-ng: info: [1176] passed: 4: vm (4)
stress-ng: info: [1176] failed: 0
stress-ng: info: [1176] metrics untrustworthy: 0
stress-ng: info: [1176] successful run completed in 1 min, 0.13 secs
[1]+ Done cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
The following test continues to use memory.low to reduce the likelihood
of tasks in high-priority memory cgroups (memcg) being reclaimed.
In this scenario, a Python script within the high-priority memcg simulates
a low-load task.
As a result, the Python script's performance is not affected by memory
reclamation (as it sleeps after allocating memory).
However, the performance of stress-ng is still impacted due to
the memory.low setting.
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high python3 -c \
"import time; a = bytearray(3*1024*1024*1024); time.sleep(62)"
[1] 1196
stress-ng: info: [1196] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1196] dispatching hogs: 4 vm
stress-ng: metrc: [1196] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1196] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1196] vm 886893 60.10 17.76 56.61 14756.92 11925.69 30.94 788676
stress-ng: info: [1196] skipped: 0
stress-ng: info: [1196] passed: 4: vm (4)
stress-ng: info: [1196] failed: 0
stress-ng: info: [1196] metrics untrustworthy: 0
stress-ng: info: [1196] successful run completed in 1 min, 0.10 secs
[1]+ Done cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
root@...ntu:~# echo 0 > /sys/fs/cgroup/high/memory.low
Now, we switch to using the memcg eBPF program for memory priority control.
memcg is a test program added to samples/bpf in this patch series.
It loads memcg.bpf.c into the kernel.
memcg.bpf.c monitors PGFAULT events in the high-priority memory cgroup.
When the number of events triggered within one second exceeds a predefined
threshold, the eBPF hook for the memory cgroup activates its control for
one second.
The following command configures the high-priority memory cgroup to
return below_min during memory reclamation if the number of PGFAULT
events per second exceeds one.
root@...ntu:~# ./memcg --low_path=/sys/fs/cgroup/low \
--high_path=/sys/fs/cgroup/high \
--threshold=1 --use_below_min
Successfully attached!
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60
[1] 1220
stress-ng: info: [1220] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1221] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1220] dispatching hogs: 4 vm
stress-ng: info: [1221] dispatching hogs: 4 vm
stress-ng: metrc: [1221] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1221] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1221] vm 24295240 60.08 221.36 7.64 404392.49 106095.60 95.29 886684
stress-ng: info: [1221] skipped: 0
stress-ng: info: [1221] passed: 4: vm (4)
stress-ng: info: [1221] failed: 0
stress-ng: info: [1221] metrics untrustworthy: 0
stress-ng: info: [1221] successful run completed in 1 min, 0.11 secs
stress-ng: metrc: [1220] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1220] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1220] vm 685732 60.13 11.69 75.98 11403.88 7822.30 36.45 496496
stress-ng: info: [1220] skipped: 0
stress-ng: info: [1220] passed: 4: vm (4)
stress-ng: info: [1220] failed: 0
stress-ng: info: [1220] metrics untrustworthy: 0
stress-ng: info: [1220] successful run completed in 1 min, 0.14 secs
[1]+ Done cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
This test demonstrates that because the Python process within the
high-priority memory cgroup is sleeping after memory allocation,
no page fault events occur.
As a result, the stress-ng process in the low-priority memory cgroup
achieves normal memory performance.
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high python3 -c \
"import time; a = bytearray(3*1024*1024*1024); time.sleep(62)"
[1] 1238
stress-ng: info: [1238] setting to a 1 min, 0 secs run per stressor
stress-ng: info: [1238] dispatching hogs: 4 vm
stress-ng: metrc: [1238] stressor bogo ops real time usr time sys time bogo ops/s bogo ops/s CPU used per RSS Max
stress-ng: metrc: [1238] (secs) (secs) (secs) (real time) (usr+sys time) instance (%) (KB)
stress-ng: metrc: [1238] vm 33107485 60.08 205.41 13.19 551082.91 151448.44 90.97 886064
stress-ng: info: [1238] skipped: 0
stress-ng: info: [1238] passed: 4: vm (4)
stress-ng: info: [1238] failed: 0
stress-ng: info: [1238] metrics untrustworthy: 0
stress-ng: info: [1238] successful run completed in 1 min, 0.09 secs
In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
I made some modifications to bpf_struct_ops_link_create
in "bpf: Pass flags in bpf_link_create for struct_ops" and
"libbpf: Support passing user-defined flags for struct_ops" to allow
the flags parameter to be passed into the kernel.
With this change, patch "mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for
memcg_bpf_ops" enables BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops.
Patch "mm: memcontrol: Add BPF struct_ops for memory controller"
introduces BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure. This is achieved
through a new struct_ops type, `memcg_bpf_ops`.
The `memcg_bpf_ops` struct provides the following hooks:
- `get_high_delay_ms`: Returns a custom throttling delay in
milliseconds for a cgroup that has breached its `memory.high`
limit. This is the primary mechanism for BPF-driven throttling.
- `below_low`: Overrides the `memory.low` protection check. If this
hook returns true, the cgroup is considered to be protected by its
`memory.low` setting, regardless of its actual usage.
- `below_min`: Similar to `below_low`, this overrides the `memory.min`
protection check.
- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
with an attached program comes online or goes offline, allowing for
state management.
Patch "samples/bpf: Add memcg priority control example" introduces
the programs memcg.c and memcg.bpf.c that were used in the previous
examples.
Changelog:
v3:
According to the comments of Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online and
handle_cgroup_offline.
According to the comments of Michal Koutný, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments of Roman Gushchin and Michal Hocko, Designed
concrete use case scenarios and provided test results.
[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
Hui Zhu (7):
bpf: Pass flags in bpf_link_create for struct_ops
libbpf: Support passing user-defined flags for struct_ops
mm: memcontrol: Add BPF struct_ops for memory controller
selftests/bpf: Add tests for memcg_bpf_ops
mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops
selftests/bpf: Add test for memcg_bpf_ops hierarchies
samples/bpf: Add memcg priority control example
Roman Gushchin (5):
bpf: move bpf_struct_ops_link into bpf.h
bpf: initial support for attaching struct ops to cgroups
bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
libbpf: introduce bpf_map__attach_struct_ops_opts()
MAINTAINERS | 4 +
include/linux/bpf.h | 8 +
include/linux/memcontrol.h | 111 +++-
kernel/bpf/bpf_struct_ops.c | 20 +-
kernel/bpf/verifier.c | 5 +
mm/bpf_memcontrol.c | 274 +++++++-
mm/memcontrol.c | 34 +-
samples/bpf/.gitignore | 1 +
samples/bpf/Makefile | 9 +-
samples/bpf/memcg.bpf.c | 129 ++++
samples/bpf/memcg.c | 327 ++++++++++
tools/include/uapi/linux/bpf.h | 2 +-
tools/lib/bpf/bpf.c | 8 +
tools/lib/bpf/libbpf.c | 19 +-
tools/lib/bpf/libbpf.h | 14 +
tools/lib/bpf/libbpf.map | 2 +-
.../selftests/bpf/prog_tests/memcg_ops.c | 606 ++++++++++++++++++
tools/testing/selftests/bpf/progs/memcg_ops.c | 129 ++++
18 files changed, 1674 insertions(+), 28 deletions(-)
create mode 100644 samples/bpf/memcg.bpf.c
create mode 100644 samples/bpf/memcg.c
create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c
--
2.43.0
Powered by blists - more mailing lists