lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <cover.1767012332.git.zhuhui@kylinos.cn>
Date: Tue, 30 Dec 2025 11:01:58 +0800
From: Hui Zhu <hui.zhu@...ux.dev>
To: Andrew Morton <akpm@...ux-foundation.org>,
	Johannes Weiner <hannes@...xchg.org>,
	Michal Hocko <mhocko@...nel.org>,
	Roman Gushchin <roman.gushchin@...ux.dev>,
	Shakeel Butt <shakeel.butt@...ux.dev>,
	Muchun Song <muchun.song@...ux.dev>,
	Alexei Starovoitov <ast@...nel.org>,
	Daniel Borkmann <daniel@...earbox.net>,
	Andrii Nakryiko <andrii@...nel.org>,
	Martin KaFai Lau <martin.lau@...ux.dev>,
	Eduard Zingerman <eddyz87@...il.com>,
	Song Liu <song@...nel.org>,
	Yonghong Song <yonghong.song@...ux.dev>,
	John Fastabend <john.fastabend@...il.com>,
	KP Singh <kpsingh@...nel.org>,
	Stanislav Fomichev <sdf@...ichev.me>,
	Hao Luo <haoluo@...gle.com>,
	Jiri Olsa <jolsa@...nel.org>,
	Shuah Khan <shuah@...nel.org>,
	Peter Zijlstra <peterz@...radead.org>,
	Miguel Ojeda <ojeda@...nel.org>,
	Nathan Chancellor <nathan@...nel.org>,
	Kees Cook <kees@...nel.org>,
	Tejun Heo <tj@...nel.org>,
	Jeff Xu <jeffxu@...omium.org>,
	mkoutny@...e.com,
	Jan Hendrik Farr <kernel@...rr.cc>,
	Christian Brauner <brauner@...nel.org>,
	Randy Dunlap <rdunlap@...radead.org>,
	Brian Gerst <brgerst@...il.com>,
	Masahiro Yamada <masahiroy@...nel.org>,
	davem@...emloft.net,
	Jakub Kicinski <kuba@...nel.org>,
	Jesper Dangaard Brouer <hawk@...nel.org>,
	linux-kernel@...r.kernel.org,
	linux-mm@...ck.org,
	cgroups@...r.kernel.org,
	bpf@...r.kernel.org,
	linux-kselftest@...r.kernel.org
Cc: Hui Zhu <zhuhui@...inos.cn>
Subject: [RFC PATCH v2 0/3] Memory Controller eBPF support

From: Hui Zhu <zhuhui@...inos.cn>

This series adds BPF struct_ops support to the memory controller,
enabling dynamic control over memory pressure through the
memcg_nr_pages_over_high mechanism. This allows administrators to
suppress low-priority cgroups' memory usage based on custom
policies implemented in BPF programs.

Background and Motivation

The memory controller provides memory.high limits to throttle
cgroups exceeding their soft limit. However, the current
implementation applies the same policy across all cgroups
without considering priority or workload characteristics.

This series introduces a BPF hook that allows reporting
additional "pages over high" for specific cgroups, effectively
increasing memory pressure and throttling for lower-priority
workloads when higher-priority cgroups need resources.

Use Case: Priority-Based Memory Management

Consider a system running both latency-sensitive services and
batch processing workloads. When the high-priority service
experiences memory pressure (detected via page scan events),
the BPF program can artificially inflate the "over high" count
for low-priority cgroups, causing them to be throttled more
aggressively and freeing up memory for the critical workload.

Implementation

This series builds upon Roman Gushchin's BPF OOM patch series in [1].

The implementation adds:
1. A memcg_bpf_ops struct_ops type with memcg_nr_pages_over_high
   hook
2. Integration into memory pressure calculation paths
3. Cgroup hierarchy management (inheritance during online/offline)
4. SRCU protection for safe concurrent access

Why Not PSI?

This implementation does not use PSI for triggering, as discussed
in [2].
Instead, the sample code monitors PGSCAN events via tracepoints,
which provides more direct feedback on memory pressure.

Example Results

Testing on x86_64 QEMU (10 CPU, 4GB RAM, cache=none swap):
root@...ntu:~# cat /proc/sys/vm/swappiness
60
root@...ntu:~# mkdir /sys/fs/cgroup/high
root@...ntu:~# mkdir /sys/fs/cgroup/low
root@...ntu:~# ./memcg /sys/fs/cgroup/low /sys/fs/cgroup/high 100 1024
Successfully attached!
root@...ntu:~# cgexec -g memory:low stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60 \
& cgexec -g memory:high stress-ng --vm 4 --vm-keep --vm-bytes 80% \
--vm-method all --seed 2025 --metrics -t 60
[1] 1075
stress-ng: info:  [1075] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1076] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [1075] dispatching hogs: 4 vm
stress-ng: info:  [1076] dispatching hogs: 4 vm
stress-ng: metrc: [1076] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1076]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1076] vm             21033377     60.47    158.04      3.66    347825.55      130076.67        66.85        834836
stress-ng: info:  [1076] skipped: 0
stress-ng: info:  [1076] passed: 4: vm (4)
stress-ng: info:  [1076] failed: 0
stress-ng: info:  [1076] metrics untrustworthy: 0
stress-ng: info:  [1076] successful run completed in 1 min, 0.72 secs
root@...ntu:~# stress-ng: metrc: [1075] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [1075]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [1075] vm                11568     65.05      0.00      0.21       177.83       56123.74         0.08          3200
stress-ng: info:  [1075] skipped: 0
stress-ng: info:  [1075] passed: 4: vm (4)
stress-ng: info:  [1075] failed: 0
stress-ng: info:  [1075] metrics untrustworthy: 0
stress-ng: info:  [1075] successful run completed in 1 min, 5.06 secs

Results show the low-priority cgroup (/sys/fs/cgroup/low) was
significantly throttled:
- High-priority cgroup: 21,033,377 bogo ops at 347,825 ops/s
- Low-priority cgroup: 11,568 bogo ops at 177 ops/s

The stress-ng process in the low-priority cgroup experienced a
~99.9% slowdown in memory operations compared to the
high-priority cgroup, demonstrating effective priority
enforcement through BPF-controlled memory pressure.

Patch Overview

PATCH 1/3: Core kernel implementation
  - Adds memcg_bpf_ops struct_ops support
  - Implements cgroup lifecycle management
  - Integrates hook into pressure calculation

PATCH 2/3: Selftest suite
  - Validates attach/detach behavior
  - Tests hierarchy inheritance
  - Verifies throttling effectiveness

PATCH 3/3: Sample programs
  - Demonstrates PGSCAN-based triggering
  - Shows priority-based throttling
  - Provides reference implementation

Changelog:
v2:
According to the comments of Tejun Heo, rebased on Roman Gushchin's BPF
OOM patch series [1] and added hierarchical delegation support.
According to the comments of Roman Gushchin and Michal Hocko, Designed
concrete use case scenarios and provided test results.

[1] https://lore.kernel.org/lkml/20251027231727.472628-1-roman.gushchin@linux.dev/
[2] https://lore.kernel.org/lkml/1d9a162605a3f32ac215430131f7745488deaa34@linux.dev/

Hui Zhu (3):
  mm: memcontrol: Add BPF struct_ops for memory  pressure control
  selftests/bpf: Add tests for memcg_bpf_ops
  samples/bpf: Add memcg priority control example

 MAINTAINERS                                   |   5 +
 include/linux/memcontrol.h                    |   2 +
 mm/bpf_memcontrol.c                           | 241 ++++++++++++-
 mm/bpf_memcontrol.h                           |  73 ++++
 mm/memcontrol.c                               |  27 +-
 samples/bpf/.gitignore                        |   1 +
 samples/bpf/Makefile                          |   9 +-
 samples/bpf/memcg.bpf.c                       |  95 +++++
 samples/bpf/memcg.c                           | 204 +++++++++++
 .../selftests/bpf/prog_tests/memcg_ops.c      | 340 ++++++++++++++++++
 .../selftests/bpf/progs/memcg_ops_over_high.c |  95 +++++
 11 files changed, 1082 insertions(+), 10 deletions(-)
 create mode 100644 mm/bpf_memcontrol.h
 create mode 100644 samples/bpf/memcg.bpf.c
 create mode 100644 samples/bpf/memcg.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/memcg_ops.c
 create mode 100644 tools/testing/selftests/bpf/progs/memcg_ops_over_high.c

-- 
2.43.0


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ