netdev - [RFC PATCH bpf-next v3 00/12] mm: memcontrol: Add BPF hooks for memory controller

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1769157382.git.zhuhui@kylinos.cn>
Date: Fri, 23 Jan 2026 16:55:18 +0800
From: Hui Zhu <hui.zhu@...ux.dev>
To: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...nel.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Alexei Starovoitov <ast@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Andrii Nakryiko <andrii@...nel.org>,
	Martin KaFai Lau <martin.lau@...ux.dev>,
	Eduard Zingerman <eddyz87@...il.com>,
	Song Liu <song@...nel.org>,
	Yonghong Song <yonghong.song@...ux.dev>,
	John Fastabend <john.fastabend@...il.com>,
	KP Singh <kpsingh@...nel.org>,
	Stanislav Fomichev <sdf@...ichev.me>,
	Hao Luo <haoluo@...gle.com>,
	Jiri Olsa <jolsa@...nel.org>,
	Shuah Khan <shuah@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Miguel Ojeda <ojeda@...nel.org>,
	Nathan Chancellor <nathan@...nel.org>,
	Kees Cook <kees@...nel.org>,
	Tejun Heo <tj@...nel.org>,
	Jeff Xu <jeffxu@...omium.org>,
	mkoutny@...e.com,
	Jan Hendrik Farr <kernel@...rr.cc>,
	Christian Brauner <brauner@...nel.org>,
	Randy Dunlap <rdunlap@...radead.org>,
	Brian Gerst <brgerst@...il.com>,
	Masahiro Yamada <masahiroy@...nel.org>,
	davem@...emloft.net,
	Jakub Kicinski <kuba@...nel.org>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	JP Kobryn <inwardvessel@...il.com>,
	Willem de Bruijn <willemb@...gle.com>,
	Jason Xing <kerneljasonxing@...il.com>,
	Paul Chaignon <paul.chaignon@...il.com>,
	Anton Protopopov <a.s.protopopov@...il.com>,
	Amery Hung <ameryhung@...il.com>,
	Chen Ridong <chenridong@...weicloud.com>,
	Lance Yang <lance.yang@...ux.dev>,
	Jiayuan Chen <jiayuan.chen@...ux.dev>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	cgroups@...r.kernel.org,
	bpf@...r.kernel.org,
	netdev@...r.kernel.org,
	linux-kselftest@...r.kernel.org
Cc: Hui Zhu <zhuhui@...inos.cn>
Subject: [RFC PATCH bpf-next v3 00/12] mm: memcontrol: Add BPF hooks for memory controller

From: Hui Zhu <zhuhui@...inos.cn>

eBPF infrastructure provides rich visibility into system performance
metrics through various tracepoints and statistics.
This patch series introduces BPF struct_ops for the memory controller.
Then the eBPF program can help the system control the memory controller
based on system performance metrics, thereby improving the utilization
of system memory resources while ensuring memory limits are respected.

The following example illustrates how memcg eBPF can improve memory
utilization in some scenarios.
The example running on x86_64 QEMU (10 CPUs, 4GB RAM), using a
file in tmpfs on the host as a swap device to reduce I/O impact.
root@...ntu:~# cat /proc/sys/vm/swappiness
60
This the high priority memcg.
root@...ntu:~# mkdir /sys/fs/cgroup/high
This the low priority memcg.
root@...ntu:~# mkdir /sys/fs/cgroup/low
root@...ntu:~# free
               total        used        free      shared  buff/cache   available
Mem:         4007276      392320     3684940         908      101476     3614956
Swap:       10485756           0    10485756

First, The following test uses memory.low to reduce the likelihood of tasks in
high-priority memory cgroups being reclaimed.
root@...ntu:~# echo $((3 * 1024 * 1024 * 1024)) > /sys/fs/cgroup/high/memory.low
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60
[1] 1176
stress-ng: info:  [1177] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1176] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1177] dispatching hogs: 4 vm
stress-ng: info:  [1176] dispatching hogs: 4 vm
stress-ng: metrc: [1177] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1177]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1177] vm             27047770     60.07    217.79      8.87    450289.91      119330.63        94.34        886936
stress-ng: info:  [1177] skipped: 0
stress-ng: info:  [1177] passed: 4: vm (4)
stress-ng: info:  [1177] failed: 0
stress-ng: info:  [1177] metrics untrustworthy: 0
stress-ng: info:  [1177] successful run completed in 1 min, 0.07 secs
stress-ng: metrc: [1176] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1176]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1176] vm               679754     60.12     11.82     72.78     11307.18        8034.42        35.18        469884
stress-ng: info:  [1176] skipped: 0
stress-ng: info:  [1176] passed: 4: vm (4)
stress-ng: info:  [1176] failed: 0
stress-ng: info:  [1176] metrics untrustworthy: 0
stress-ng: info:  [1176] successful run completed in 1 min, 0.13 secs
[1]+  Done                    cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60


The following test continues to use memory.low to reduce the likelihood
of tasks in high-priority memory cgroups (memcg) being reclaimed.
In this scenario, a Python script within the high-priority memcg simulates
a low-load task.
As a result, the Python script's performance is not affected by memory
reclamation (as it sleeps after allocating memory).
However, the performance of stress-ng is still impacted due to
the memory.low setting.

root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high python3 -c \
"import time; a = bytearray(3*1024*1024*1024); time.sleep(62)"
[1] 1196
stress-ng: info:  [1196] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1196] dispatching hogs: 4 vm
stress-ng: metrc: [1196] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1196]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1196] vm               886893     60.10     17.76     56.61     14756.92       11925.69        30.94        788676
stress-ng: info:  [1196] skipped: 0
stress-ng: info:  [1196] passed: 4: vm (4)
stress-ng: info:  [1196] failed: 0
stress-ng: info:  [1196] metrics untrustworthy: 0
stress-ng: info:  [1196] successful run completed in 1 min, 0.10 secs
[1]+  Done                    cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60
root@...ntu:~# echo 0 > /sys/fs/cgroup/high/memory.low

Now, we switch to using the memcg eBPF program for memory priority control.
memcg is a test program added to samples/bpf in this patch series.
It loads memcg.bpf.c into the kernel.
memcg.bpf.c monitors PGFAULT events in the high-priority memory cgroup.
When the number of events triggered within one second exceeds a predefined
threshold, the eBPF hook for the memory cgroup activates its control for
one second.

The following command configures the high-priority memory cgroup to
return below_min during memory reclamation if the number of PGFAULT
events per second exceeds one.
root@...ntu:~# ./memcg --low_path=/sys/fs/cgroup/low \
     --high_path=/sys/fs/cgroup/high \
     --threshold=1 --use_below_min
Successfully attached!

root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60
[1] 1220
stress-ng: info:  [1220] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1221] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1220] dispatching hogs: 4 vm
stress-ng: info:  [1221] dispatching hogs: 4 vm
stress-ng: metrc: [1221] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1221]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1221] vm             24295240     60.08    221.36      7.64    404392.49      106095.60        95.29        886684
stress-ng: info:  [1221] skipped: 0
stress-ng: info:  [1221] passed: 4: vm (4)
stress-ng: info:  [1221] failed: 0
stress-ng: info:  [1221] metrics untrustworthy: 0
stress-ng: info:  [1221] successful run completed in 1 min, 0.11 secs
stress-ng: metrc: [1220] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1220]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1220] vm               685732     60.13     11.69     75.98     11403.88        7822.30        36.45        496496
stress-ng: info:  [1220] skipped: 0
stress-ng: info:  [1220] passed: 4: vm (4)
stress-ng: info:  [1220] failed: 0
stress-ng: info:  [1220] metrics untrustworthy: 0
stress-ng: info:  [1220] successful run completed in 1 min, 0.14 secs
[1]+  Done                    cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes $((3 * 1024 * 1024 * 1024)) --vm-method all --seed 2025 --metrics -t 60

This test demonstrates that because the Python process within the
high-priority memory cgroup is sleeping after memory allocation, 
no page fault events occur.
As a result, the stress-ng process in the low-priority memory cgroup
achieves normal memory performance.
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep \
--vm-bytes $((3 * 1024 * 1024 * 1024)) \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high python3 -c \
"import time; a = bytearray(3*1024*1024*1024); time.sleep(62)"
[1] 1238
stress-ng: info:  [1238] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1238] dispatching hogs: 4 vm
stress-ng: metrc: [1238] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1238]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1238] vm             33107485     60.08    205.41     13.19    551082.91      151448.44        90.97        886064
stress-ng: info:  [1238] skipped: 0
stress-ng: info:  [1238] passed: 4: vm (4)
stress-ng: info:  [1238] failed: 0
stress-ng: info:  [1238] metrics untrustworthy: 0
stress-ng: info:  [1238] successful run completed in 1 min, 0.09 secs

In this patch series, I've incorporated a portion of Roman's patch in
[1] to ensure the entire series can be compiled cleanly with bpf-next.
I made some modifications to bpf_struct_ops_link_create
in "bpf: Pass flags in bpf_link_create for struct_ops" and
"libbpf: Support passing user-defined flags for struct_ops" to allow
the flags parameter to be passed into the kernel.
With this change, patch "mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for
memcg_bpf_ops" enables BPF_F_ALLOW_OVERRIDE support for memcg_bpf_ops.

Patch "mm: memcontrol: Add BPF struct_ops for memory controller"
introduces BPF struct_ops support to the memory controller, enabling
custom and dynamic control over memory pressure. This is achieved
through a new struct_ops type, `memcg_bpf_ops`.
The `memcg_bpf_ops` struct provides the following hooks:
- `get_high_delay_ms`: Returns a custom throttling delay in
  milliseconds for a cgroup that has breached its `memory.high`
  limit. This is the primary mechanism for BPF-driven throttling.
- `below_low`: Overrides the `memory.low` protection check. If this
  hook returns true, the cgroup is considered to be protected by its
  `memory.low` setting, regardless of its actual usage.
- `below_min`: Similar to `below_low`, this overrides the `memory.min`
  protection check.
- `handle_cgroup_online`/`offline`: Callbacks invoked when a cgroup
  with an attached program comes online or goes offline, allowing for
  state management.

Patch "samples/bpf: Add memcg priority control example" introduces
the programs memcg.c and memcg.bpf.c that were used in the previous
examples.

Changelog:
v3:
According to the comments of Michal Koutný and Chen Ridong, update hooks
to get_high_delay_ms, below_low, below_min, handle_cgroup_online and
handle_cgroup_offline.
According to the comments of Michal Koutný, add BPF_F_ALLOW_OVERRIDE
support to memcg_bpf_ops.
v2:
According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments of Roman Gushchin and Michal Hocko, Designed
concrete use case scenarios and provided test results.

[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/

Hui Zhu (7):
  bpf: Pass flags in bpf_link_create for struct_ops
  libbpf: Support passing user-defined flags for struct_ops
  mm: memcontrol: Add BPF struct_ops for memory controller
  selftests/bpf: Add tests for memcg_bpf_ops
  mm/bpf: Add BPF_F_ALLOW_OVERRIDE support for  memcg_bpf_ops
  selftests/bpf: Add test for memcg_bpf_ops hierarchies
  samples/bpf: Add memcg priority control example

Roman Gushchin (5):
  bpf: move bpf_struct_ops_link into bpf.h
  bpf: initial support for attaching struct ops to cgroups
  bpf: mark struct oom_control's memcg field as TRUSTED_OR_NULL
  mm: define mem_cgroup_get_from_ino() outside of CONFIG_SHRINKER_DEBUG
  libbpf: introduce bpf_map__attach_struct_ops_opts()

 MAINTAINERS                                   |   4 +
 include/linux/bpf.h                           |   8 +
 include/linux/memcontrol.h                    | 111 +++-
 kernel/bpf/bpf_struct_ops.c                   |  20 +-
 kernel/bpf/verifier.c                         |   5 +
 mm/bpf_memcontrol.c                           | 274 +++++++-
 mm/memcontrol.c                               |  34 +-
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   9 +-
 samples/bpf/memcg.bpf.c                       | 129 ++++
 samples/bpf/memcg.c                           | 327 ++++++++++
 tools/include/uapi/linux/bpf.h                |   2 +-
 tools/lib/bpf/bpf.c                           |   8 +
 tools/lib/bpf/libbpf.c                        |  19 +-
 tools/lib/bpf/libbpf.h                        |  14 +
 tools/lib/bpf/libbpf.map                      |   2 +-
 .../selftests/bpf/prog_tests/memcg_ops.c      | 606 ++++++++++++++++++
 tools/testing/selftests/bpf/progs/memcg_ops.c | 129 ++++
 18 files changed, 1674 insertions(+), 28 deletions(-)
 create mode 100644 samples/bpf/memcg.bpf.c
 create mode 100644 samples/bpf/memcg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops.c

-- 
2.43.0