lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [thread-next>] [day] [month] [year] [list]
Message-ID: <20250920005931.2753828-1-tj@kernel.org>
Date: Fri, 19 Sep 2025 14:58:23 -1000
From: Tejun Heo <tj@...nel.org>
To: void@...ifault.com,
	arighi@...dia.com,
	multics69@...il.com
Cc: linux-kernel@...r.kernel.org,
	sched-ext@...ts.linux.dev,
	memxor@...il.com,
	bpf@...r.kernel.org
Subject: [PATCHSET RFC] sched_ext: Implement cgroup sub-scheduler support

This patchset implements cgroup sub-scheduler support for sched_ext,
enabling multiple schedulers to operate hierarchically within the cgroup
tree. This capability supports multi-tenant server environments and other
scenarios where systems must be partitioned to serve distinct workloads,
each requiring specialized scheduling policies.

Traditional approaches rely on hard partitioning via cpuset, but this
approach lacks the dynamism required by modern workloads. Users typically
care less about specific CPU assignments and more about optimizations
available on larger machines: opportunistic over-commit, improved latency
for critical workloads while preserving bandwidth fairness, control
mechanisms beyond simple CPU time (such as memory bandwidth isolation),
and intelligent placement to optimize cache locality.

The cgroup sub-scheduler approach enables schedulers to attach anywhere
in the cgroup hierarchy, with parent schedulers dynamically controlling
CPU allocation to their children. This design provides BPF-driven
flexibility while eliminating the constraints of hard partitioning.

This early-stage implementation demonstrates the fundamental building
blocks for hierarchical scheduler operation. While the enqueue path and
other components require further development, this patchset establishes
the core mechanisms for nested scheduler operation and is developed
enough to showcase all essential components.

The framework supports scheduling hierarchies up to SCX_SUB_MAX_DEPTH
levels (currently set to 4). Enable and disable operations selectively
bypass only tasks within the affected subtree, minimizing system-wide
disruptions while providing reasonable isolation for child scheduler
failures.

To see how this looks from the BPF scheduler perspective, examine the
scx_qmap.bpf.c changes in the final patch, which demonstrate simple
nested dispatch implementation.

Patch Organization:

Standalone fixes (01-07):
 01 sched_ext: Use rhashtable_lookup() instead of rhashtable_lookup_fast()
 02 sched_ext: Improve SCX_KF_DISPATCH comment
 03 sched_ext: Fix stray scx_root usage in task_can_run_on()
 04 sched_ext: Use bitfields for boolean warning flags
 05 sched_ext: Add SCX_EFLAG_INITIALIZED to indicate successful initialization
 06 sched_ext: Make qmap dump operation non-destructive
 07 tools/sched_ext: scx_qmap: Make debug output quieter by default

Preparation patches (08-22):
 08 sched_ext: Separate out scx_kick_cpu() and add sch to its name
 09 sched_ext: Add the sch parameter to __bstr_format()
 10 sched_ext: Add the sch parameter to ext_idle helpers
 11 sched_ext: Drop kf_cpu_valid
 12 sched_ext: Add the sch parameter to scx_dsq_insert_preamble()
 13 sched_ext: Drop scx_kf_exit() and scx_kf_error()
 14 sched_ext: Misc updates around scx_sched instance pointer handling
 15 sched_ext: Keep dying tasks on a separate list
 16 sched_ext: Implement cgroup subtree iteration for scx_task_iter
 17 sched_ext: Add kargs to scx_fork()
 18 sched/core: Swap the order between sched_post_fork() and wake_up_new_task()
 19 cgroup: Expose some cgroup helpers
 20 sched_ext: Update p->scx.disallow warning in scx_init_task()
 21 sched_ext: Minor reorganization of enable/disable paths
 22 sched_ext: Factor out scx_claim_exit() from scx_disable()

Core sub-scheduler implementation (23-40):
 23 sched_ext: Introduce cgroup sub-sched support
 24 HACK_NOT_FOR_UPSTREAM: BPF: Implement prog grouping hack
 25 sched_ext: Introduce scx_task_sched()/_rcu()
 26 sched_ext: Introduce scx_prog_sched()
 27 sched_ext: Ignore insertions of not owned tasks into sub-sched DSQs
 28 sched_ext: scx_dsq_move() should validate the task belongs to the dsq
 29 sched_ext: Refactor task init/exit helpers
 30 sched_ext: Make scx_prio_less() handle multiple schedulers
 31 sched_ext: Move bypass_depth into scx_sched
 32 sched_ext: Make bypass mode sub-sched aware
 33 sched_ext: Factor out scx_dispatch_sched()
 34 sched_ext: When calling ops.dispatch(), prev must be on the correct sched
 35 sched_ext: Dispatch from all scx_sched instances
 36 sched_ext: Move scx_dsp_ctx and scx_dsp_max_batch into scx_sched
 37 sched_ext: Make watchdog sub-sched aware
 38 sched_ext: Convert scx_dump_state() spinlock to raw spinlock
 39 sched_ext: Support dumping multiple schedulers and add scheduler identification
 40 sched_ext: Implement cgroup sub-sched enabling and disabling

Nested dispatch implementation (41-46):
 41 HACK_NOT_FOR_UPSTREAM: sched_ext: Work around @aux__prog prototype mismatch
 42 sched_ext: Wrap global DSQs in per-node structure
 43 sched_ext: Add bypass DSQ for sub-schedulers
 44 sched_ext: Factor out scx_link_sched() and scx_unlink_sched()
 45 sched_ext: Add rhashtable lookup for sub-schedulers
 46 sched_ext: Add basic building blocks for nested sub-scheduler dispatching

Implementation Notes:
- Patches 01-07: Independent fixes to be separated after merge window
- Patches 08-22: Infrastructure preparation (mostly sched_ext, one cgroup change)
- Patch 23: Skeletal sub-scheduler support (create/destroy only)
- Patches 24,41: Temporary BPF hacks requiring proper upstream solution
- Patches 25-39: Task migration and multi-scheduler operation mechanisms
- Patch 40: Full sub-scheduler enable/disable with ops.dispatch() support
  (enqueue path not yet implemented)
- Patches 42-46: Nested dispatch infrastructure and scx_bpf_dispatch_sched()

The patches are available in the git repository:
git://git.kernel.org/pub/scm/linux/kernel/git/tj/sched_ext.git scx-sub-sched

 include/linux/bpf.h                      |    5
 include/linux/cgroup-defs.h              |    4
 include/linux/cgroup.h                   |   65
 include/linux/sched.h                    |    2
 include/linux/sched/ext.h                |   21
 init/Kconfig                             |    4
 kernel/bpf/syscall.c                     |   23
 kernel/cgroup/cgroup-internal.h          |    6
 kernel/cgroup/cgroup.c                   |   55
 kernel/exit.c                            |    1
 kernel/fork.c                            |    6
 kernel/sched/core.c                      |    2
 kernel/sched/ext.c                       | 2362 ++++++++++++++++++++++++-------
 kernel/sched/ext.h                       |    4
 kernel/sched/ext_idle.c                  |  197 ++
 kernel/sched/ext_internal.h              |  223 ++
 kernel/sched/sched.h                     |    7
 tools/sched_ext/include/scx/common.bpf.h |   90 -
 tools/sched_ext/include/scx/compat.bpf.h |    7
 tools/sched_ext/scx_qmap.bpf.c           |  146 +
 tools/sched_ext/scx_qmap.c               |   36
 21 files changed, 2556 insertions(+), 710 deletions(-)

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ