lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <c631759f-6e71-4c27-9a56-fc3159793e81@igalia.com>
Date:   Mon, 13 Nov 2023 22:34:23 +0900
From:   Changwoo Min <changwoo@...lia.com>
To:     Tejun Heo <tj@...nel.org>, torvalds@...ux-foundation.org,
        mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        vincent.guittot@...aro.org, dietmar.eggemann@....com,
        rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de,
        bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
        daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
        joshdon@...gle.com, brho@...gle.com, pjt@...gle.com,
        derkling@...gle.com, haoluo@...gle.com, dvernet@...a.com,
        dschatzberg@...a.com, dskarlat@...cmu.edu, riel@...riel.com,
        himadrics@...ia.fr, memxor@...il.com
Cc:     linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
        kernel-team@...a.com, Andrea Righi <andrea.righi@...onical.com>,
        kernel-dev@...lia.com
Subject: Re: [PATCH 12/36] sched_ext: Implement BPF extensible scheduler class

Currently, scx_ops_enable_state_str is defined only when
CONFIG_SCHED_DEBUG is enabled. However, print_scx_info() uses
scx_ops_enable_state_str regardless that CONFIG_SCHED_DEBUG is enabled
or not. So when CONFIG_SCHED_DEBUG is not enabled, the current code
generates the following compilation error:

kernel/sched/ext.c: In function ‘print_scx_info’:
kernel/sched/ext.c:3720:24: error: ‘scx_ops_enable_state_str’ undeclared

So CONFIG_SCHED_DEBUG should be moved to after the definition of
scx_ops_enable_state_str.

--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -3406,7 +3406,6 @@ static int scx_ops_enable(struct sched_ext_ops *ops)
  	return ret;
  }

-#ifdef CONFIG_SCHED_DEBUG
  static const char *scx_ops_enable_state_str[] = {
  	[SCX_OPS_PREPPING]	= "prepping",
  	[SCX_OPS_ENABLING]	= "enabling",
@@ -3415,6 +3414,7 @@ static const char *scx_ops_enable_state_str[] = {
  	[SCX_OPS_DISABLED]	= "disabled",
  };

+#ifdef CONFIG_SCHED_DEBUG
  static int scx_debug_show(struct seq_file *m, void *v)
  {
  	mutex_lock(&scx_ops_enable_mutex);
-- 

On 23. 11. 11. 11:47, Tejun Heo wrote:
> Implement a new scheduler class sched_ext (SCX), which allows scheduling
> policies to be implemented as BPF programs to achieve the following:
> 
> 1. Ease of experimentation and exploration: Enabling rapid iteration of new
>     scheduling policies.
> 
> 2. Customization: Building application-specific schedulers which implement
>     policies that are not applicable to general-purpose schedulers.
> 
> 3. Rapid scheduler deployments: Non-disruptive swap outs of scheduling
>     policies in production environments.
> 
> sched_ext leverages BPF’s struct_ops feature to define a structure which
> exports function callbacks and flags to BPF programs that wish to implement
> scheduling policies. The struct_ops structure exported by sched_ext is
> struct sched_ext_ops, and is conceptually similar to struct sched_class. The
> role of sched_ext is to map the complex sched_class callbacks to the more
> simple and ergonomic struct sched_ext_ops callbacks.
> 
> For more detailed discussion on the motivations and overview, please refer
> to the cover letter.
> 
> Later patches will also add several example schedulers and documentation.
> 
> This patch implements the minimum core framework to enable implementation of
> BPF schedulers. Subsequent patches will gradually add functionalities
> including safety guarantee mechanisms, nohz and cgroup support.
> 
> include/linux/sched/ext.h defines struct sched_ext_ops. With the comment on
> top, each operation should be self-explanatory. The followings are worth
> noting:
> 
> * Both "sched_ext" and its shorthand "scx" are used. If the identifier
>    already has "sched" in it, "ext" is used; otherwise, "scx".
> 
> * In sched_ext_ops, only .name is mandatory. Every operation is optional and
>    if omitted a simple but functional default behavior is provided.
> 
> * A new policy constant SCHED_EXT is added and a task can select sched_ext
>    by invoking sched_setscheduler(2) with the new policy constant. However,
>    if the BPF scheduler is not loaded, SCHED_EXT is the same as SCHED_NORMAL
>    and the task is scheduled by CFS. When the BPF scheduler is loaded, all
>    tasks which have the SCHED_EXT policy are switched to sched_ext.
> 
> * To bridge the workflow imbalance between the scheduler core and
>    sched_ext_ops callbacks, sched_ext uses simple FIFOs called dispatch
>    queues (dsq's). By default, there is one global dsq (SCX_DSQ_GLOBAL), and
>    one local per-CPU dsq (SCX_DSQ_LOCAL). SCX_DSQ_GLOBAL is provided for
>    convenience and need not be used by a scheduler that doesn't require it.
>    SCX_DSQ_LOCAL is the per-CPU FIFO that sched_ext pulls from when putting
>    the next task on the CPU. The BPF scheduler can manage an arbitrary number
>    of dsq's using scx_bpf_create_dsq() and scx_bpf_destroy_dsq().
> 
> * sched_ext guarantees system integrity no matter what the BPF scheduler
>    does. To enable this, each task's ownership is tracked through
>    p->scx.ops_state and all tasks are put on scx_tasks list. The disable path
>    can always recover and revert all tasks back to CFS. See p->scx.ops_state
>    and scx_tasks.
> 
> * A task is not tied to its rq while enqueued. This decouples CPU selection
>    from queueing and allows sharing a scheduling queue across an arbitrary
>    subset of CPUs. This adds some complexities as a task may need to be
>    bounced between rq's right before it starts executing. See
>    dispatch_to_local_dsq() and move_task_to_local_dsq().
> 
> * One complication that arises from the above weak association between task
>    and rq is that synchronizing with dequeue() gets complicated as dequeue()
>    may happen anytime while the task is enqueued and the dispatch path might
>    need to release the rq lock to transfer the task. Solving this requires a
>    bit of complexity. See the logic around p->scx.sticky_cpu and
>    p->scx.ops_qseq.
> 
> * Both enable and disable paths are a bit complicated. The enable path
>    switches all tasks without blocking to avoid issues which can arise from
>    partially switched states (e.g. the switching task itself being starved).
>    The disable path can't trust the BPF scheduler at all, so it also has to
>    guarantee forward progress without blocking. See scx_ops_enable() and
>    scx_ops_disable_workfn().
> 
> * When sched_ext is disabled, static_branches are used to shut down the
>    entry points from hot paths.
> 
> v5: * To accommodate 32bit configs, p->scx.ops_state is now atomic_long_t
>        instead of atomic64_t and scx_dsp_buf_ent.qseq which uses
>        load_acquire/store_release is now unsigned long instead of u64.
> 
>      * Fix the bug where bpf_scx_btf_struct_access() was allowing write
>        access to arbitrary fields.
> 
>      * Distinguish kfuncs which can be called from any sched_ext ops and from
>        anywhere. e.g. scx_bpf_pick_idle_cpu() can now be called only from
>        sched_ext ops.
> 
>      * Rename "type" to "kind" in scx_exit_info to make it easier to use on
>        languages in which "type" is a reserved keyword.
> 
>      * Since cff9b2332ab7 ("kernel/sched: Modify initial boot task idle
>        setup"), PF_IDLE is not set on idle tasks which haven't been online
>        yet which made scx_task_iter_next_filtered() include those idle tasks
>        in iterations leading to oopses. Update scx_task_iter_next_filtered()
>        to directly test p->sched_class against idle_sched_class instead of
>        using is_idle_task() which tests PF_IDLE.
> 
>      * Other updates to match upstream changes such as adding const to
>        set_cpumask() param and renaming check_preempt_curr() to
>        wakeup_preempt().
> 
> v4: * SCHED_CHANGE_BLOCK replaced with the previous
>        sched_deq_and_put_task()/sched_enq_and_set_tsak() pair. This is
>        because upstream is adaopting a different generic cleanup mechanism.
>        Once that lands, the code will be adapted accordingly.
> 
>      * task_on_scx() used to test whether a task should be switched into SCX,
>        which is confusing. Renamed to task_should_scx(). task_on_scx() now
>        tests whether a task is currently on SCX.
> 
>      * scx_has_idle_cpus is barely used anymore and replaced with direct
>        check on the idle cpumask.
> 
>      * SCX_PICK_IDLE_CORE added and scx_pick_idle_cpu() improved to prefer
>        fully idle cores.
> 
>      * ops.enable() now sees up-to-date p->scx.weight value.
> 
>      * ttwu_queue path is disabled for tasks on SCX to avoid confusing BPF
>        schedulers expecting ->select_cpu() call.
> 
>      * Use cpu_smt_mask() instead of topology_sibling_cpumask() like the rest
>        of the scheduler.
> 
> v3: * ops.set_weight() added to allow BPF schedulers to track weight changes
>        without polling p->scx.weight.
> 
>      * move_task_to_local_dsq() was losing SCX-specific enq_flags when
>        enqueueing the task on the target dsq because it goes through
>        activate_task() which loses the upper 32bit of the flags. Carry the
>        flags through rq->scx.extra_enq_flags.
> 
>      * scx_bpf_dispatch(), scx_bpf_pick_idle_cpu(), scx_bpf_task_running()
>        and scx_bpf_task_cpu() now use the new KF_RCU instead of
>        KF_TRUSTED_ARGS to make it easier for BPF schedulers to call them.
> 
>      * The kfunc helper access control mechanism implemented through
>        sched_ext_entity.kf_mask is improved. Now SCX_CALL_OP*() is always
>        used when invoking scx_ops operations.
> 
> v2: * balance_scx_on_up() is dropped. Instead, on UP, balance_scx() is
>        called from put_prev_taks_scx() and pick_next_task_scx() as necessary.
>        To determine whether balance_scx() should be called from
>        put_prev_task_scx(), SCX_TASK_DEQD_FOR_SLEEP flag is added. See the
>        comment in put_prev_task_scx() for details.
> 
>      * sched_deq_and_put_task() / sched_enq_and_set_task() sequences replaced
>        with SCHED_CHANGE_BLOCK().
> 
>      * Unused all_dsqs list removed. This was a left-over from previous
>        iterations.
> 
>      * p->scx.kf_mask is added to track and enforce which kfunc helpers are
>        allowed. Also, init/exit sequences are updated to make some kfuncs
>        always safe to call regardless of the current BPF scheduler state.
>        Combined, this should make all the kfuncs safe.
> 
>      * BPF now supports sleepable struct_ops operations. Hacky workaround
>        removed and operations and kfunc helpers are tagged appropriately.
> 
>      * BPF now supports bitmask / cpumask helpers. scx_bpf_get_idle_cpumask()
>        and friends are added so that BPF schedulers can use the idle masks
>        with the generic helpers. This replaces the hacky kfunc helpers added
>        by a separate patch in V1.
> 
>      * CONFIG_SCHED_CLASS_EXT can no longer be enabled if SCHED_CORE is
>        enabled. This restriction will be removed by a later patch which adds
>        core-sched support.
> 
>      * Add MAINTAINERS entries and other misc changes.
> 
> Signed-off-by: Tejun Heo <tj@...nel.org>
> Co-authored-by: David Vernet <dvernet@...a.com>
> Acked-by: Josh Don <joshdon@...gle.com>
> Acked-by: Hao Luo <haoluo@...gle.com>
> Acked-by: Barret Rhoden <brho@...gle.com>
> Cc: Andrea Righi <andrea.righi@...onical.com>
> ---
>   MAINTAINERS                       |    3 +
>   include/asm-generic/vmlinux.lds.h |    1 +
>   include/linux/sched.h             |    5 +
>   include/linux/sched/ext.h         |  401 +++-
>   include/uapi/linux/sched.h        |    1 +
>   init/init_task.c                  |   10 +
>   kernel/Kconfig.preempt            |   22 +-
>   kernel/bpf/bpf_struct_ops_types.h |    4 +
>   kernel/sched/build_policy.c       |    4 +
>   kernel/sched/core.c               |   70 +
>   kernel/sched/debug.c              |    6 +
>   kernel/sched/ext.c                | 3158 +++++++++++++++++++++++++++++
>   kernel/sched/ext.h                |  118 +-
>   kernel/sched/sched.h              |   16 +
>   14 files changed, 3815 insertions(+), 4 deletions(-)
>   create mode 100644 kernel/sched/ext.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 14e1194faa4b..defe8e7e4c8f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -19188,6 +19188,8 @@ R:	Ben Segall <bsegall@...gle.com> (CONFIG_CFS_BANDWIDTH)
>   R:	Mel Gorman <mgorman@...e.de> (CONFIG_NUMA_BALANCING)
>   R:	Daniel Bristot de Oliveira <bristot@...hat.com> (SCHED_DEADLINE)
>   R:	Valentin Schneider <vschneid@...hat.com> (TOPOLOGY)
> +R:	Tejun Heo <tj@...nel.org> (SCHED_EXT)
> +R:	David Vernet <void@...ifault.com> (SCHED_EXT)
>   L:	linux-kernel@...r.kernel.org
>   S:	Maintained
>   T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git sched/core
> @@ -19196,6 +19198,7 @@ F:	include/linux/sched.h
>   F:	include/linux/wait.h
>   F:	include/uapi/linux/sched.h
>   F:	kernel/sched/
> +F:	tools/sched_ext/
>   
>   SCSI LIBSAS SUBSYSTEM
>   R:	John Garry <john.g.garry@...cle.com>
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 67d8dd2f1bde..575322902ef9 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -131,6 +131,7 @@
>   	*(__dl_sched_class)			\
>   	*(__rt_sched_class)			\
>   	*(__fair_sched_class)			\
> +	*(__ext_sched_class)			\
>   	*(__idle_sched_class)			\
>   	__sched_class_lowest = .;
>   
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 12ec109ce8c9..e921883fbe34 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -70,6 +70,8 @@ struct task_delay_info;
>   struct task_group;
>   struct user_event_mm;
>   
> +#include <linux/sched/ext.h>
> +
>   /*
>    * Task state bitmask. NOTE! These bits are also
>    * encoded in fs/proc/array.c: get_task_state().
> @@ -795,6 +797,9 @@ struct task_struct {
>   	struct sched_entity		se;
>   	struct sched_rt_entity		rt;
>   	struct sched_dl_entity		dl;
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	struct sched_ext_entity		scx;
> +#endif
>   	const struct sched_class	*sched_class;
>   
>   #ifdef CONFIG_SCHED_CORE
> diff --git a/include/linux/sched/ext.h b/include/linux/sched/ext.h
> index a05dfcf533b0..b6462d953ec6 100644
> --- a/include/linux/sched/ext.h
> +++ b/include/linux/sched/ext.h
> @@ -1,9 +1,408 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2022 Tejun Heo <tj@...nel.org>
> + * Copyright (c) 2022 David Vernet <dvernet@...a.com>
> + */
>   #ifndef _LINUX_SCHED_EXT_H
>   #define _LINUX_SCHED_EXT_H
>   
>   #ifdef CONFIG_SCHED_CLASS_EXT
> -#error "NOT IMPLEMENTED YET"
> +
> +#include <linux/rhashtable.h>
> +#include <linux/llist.h>
> +
> +enum scx_consts {
> +	SCX_OPS_NAME_LEN	= 128,
> +	SCX_EXIT_REASON_LEN	= 128,
> +	SCX_EXIT_BT_LEN		= 64,
> +	SCX_EXIT_MSG_LEN	= 1024,
> +
> +	SCX_SLICE_DFL		= 20 * NSEC_PER_MSEC,
> +};
> +
> +/*
> + * DSQ (dispatch queue) IDs are 64bit of the format:
> + *
> + *   Bits: [63] [62 ..  0]
> + *         [ B] [   ID   ]
> + *
> + *    B: 1 for IDs for built-in DSQs, 0 for ops-created user DSQs
> + *   ID: 63 bit ID
> + *
> + * Built-in IDs:
> + *
> + *   Bits: [63] [62] [61..32] [31 ..  0]
> + *         [ 1] [ L] [   R  ] [    V   ]
> + *
> + *    1: 1 for built-in DSQs.
> + *    L: 1 for LOCAL_ON DSQ IDs, 0 for others
> + *    V: For LOCAL_ON DSQ IDs, a CPU number. For others, a pre-defined value.
> + */
> +enum scx_dsq_id_flags {
> +	SCX_DSQ_FLAG_BUILTIN	= 1LLU << 63,
> +	SCX_DSQ_FLAG_LOCAL_ON	= 1LLU << 62,
> +
> +	SCX_DSQ_INVALID		= SCX_DSQ_FLAG_BUILTIN | 0,
> +	SCX_DSQ_GLOBAL		= SCX_DSQ_FLAG_BUILTIN | 1,
> +	SCX_DSQ_LOCAL		= SCX_DSQ_FLAG_BUILTIN | 2,
> +	SCX_DSQ_LOCAL_ON	= SCX_DSQ_FLAG_BUILTIN | SCX_DSQ_FLAG_LOCAL_ON,
> +	SCX_DSQ_LOCAL_CPU_MASK	= 0xffffffffLLU,
> +};
> +
> +enum scx_exit_kind {
> +	SCX_EXIT_NONE,
> +	SCX_EXIT_DONE,
> +
> +	SCX_EXIT_UNREG = 64,	/* BPF unregistration */
> +
> +	SCX_EXIT_ERROR = 1024,	/* runtime error, error msg contains details */
> +	SCX_EXIT_ERROR_BPF,	/* ERROR but triggered through scx_bpf_error() */
> +};
> +
> +/*
> + * scx_exit_info is passed to ops.exit() to describe why the BPF scheduler is
> + * being disabled.
> + */
> +struct scx_exit_info {
> +	/* %SCX_EXIT_* - broad category of the exit reason */
> +	enum scx_exit_kind	kind;
> +	/* textual representation of the above */
> +	char			reason[SCX_EXIT_REASON_LEN];
> +	/* number of entries in the backtrace */
> +	u32			bt_len;
> +	/* backtrace if exiting due to an error */
> +	unsigned long		bt[SCX_EXIT_BT_LEN];
> +	/* extra message */
> +	char			msg[SCX_EXIT_MSG_LEN];
> +};
> +
> +/* sched_ext_ops.flags */
> +enum scx_ops_flags {
> +	/*
> +	 * Keep built-in idle tracking even if ops.update_idle() is implemented.
> +	 */
> +	SCX_OPS_KEEP_BUILTIN_IDLE = 1LLU << 0,
> +
> +	/*
> +	 * By default, if there are no other task to run on the CPU, ext core
> +	 * keeps running the current task even after its slice expires. If this
> +	 * flag is specified, such tasks are passed to ops.enqueue() with
> +	 * %SCX_ENQ_LAST. See the comment above %SCX_ENQ_LAST for more info.
> +	 */
> +	SCX_OPS_ENQ_LAST	= 1LLU << 1,
> +
> +	/*
> +	 * An exiting task may schedule after PF_EXITING is set. In such cases,
> +	 * bpf_task_from_pid() may not be able to find the task and if the BPF
> +	 * scheduler depends on pid lookup for dispatching, the task will be
> +	 * lost leading to various issues including RCU grace period stalls.
> +	 *
> +	 * To mask this problem, by default, unhashed tasks are automatically
> +	 * dispatched to the local DSQ on enqueue. If the BPF scheduler doesn't
> +	 * depend on pid lookups and wants to handle these tasks directly, the
> +	 * following flag can be used.
> +	 */
> +	SCX_OPS_ENQ_EXITING	= 1LLU << 2,
> +
> +	SCX_OPS_ALL_FLAGS	= SCX_OPS_KEEP_BUILTIN_IDLE |
> +				  SCX_OPS_ENQ_LAST |
> +				  SCX_OPS_ENQ_EXITING,
> +};
> +
> +/* argument container for ops.enable() and friends */
> +struct scx_enable_args {
> +	/* empty for now */
> +};
> +
> +/**
> + * struct sched_ext_ops - Operation table for BPF scheduler implementation
> + *
> + * Userland can implement an arbitrary scheduling policy by implementing and
> + * loading operations in this table.
> + */
> +struct sched_ext_ops {
> +	/**
> +	 * select_cpu - Pick the target CPU for a task which is being woken up
> +	 * @p: task being woken up
> +	 * @prev_cpu: the cpu @p was on before sleeping
> +	 * @wake_flags: SCX_WAKE_*
> +	 *
> +	 * Decision made here isn't final. @p may be moved to any CPU while it
> +	 * is getting dispatched for execution later. However, as @p is not on
> +	 * the rq at this point, getting the eventual execution CPU right here
> +	 * saves a small bit of overhead down the line.
> +	 *
> +	 * If an idle CPU is returned, the CPU is kicked and will try to
> +	 * dispatch. While an explicit custom mechanism can be added,
> +	 * select_cpu() serves as the default way to wake up idle CPUs.
> +	 */
> +	s32 (*select_cpu)(struct task_struct *p, s32 prev_cpu, u64 wake_flags);
> +
> +	/**
> +	 * enqueue - Enqueue a task on the BPF scheduler
> +	 * @p: task being enqueued
> +	 * @enq_flags: %SCX_ENQ_*
> +	 *
> +	 * @p is ready to run. Dispatch directly by calling scx_bpf_dispatch()
> +	 * or enqueue on the BPF scheduler. If not directly dispatched, the bpf
> +	 * scheduler owns @p and if it fails to dispatch @p, the task will
> +	 * stall.
> +	 */
> +	void (*enqueue)(struct task_struct *p, u64 enq_flags);
> +
> +	/**
> +	 * dequeue - Remove a task from the BPF scheduler
> +	 * @p: task being dequeued
> +	 * @deq_flags: %SCX_DEQ_*
> +	 *
> +	 * Remove @p from the BPF scheduler. This is usually called to isolate
> +	 * the task while updating its scheduling properties (e.g. priority).
> +	 *
> +	 * The ext core keeps track of whether the BPF side owns a given task or
> +	 * not and can gracefully ignore spurious dispatches from BPF side,
> +	 * which makes it safe to not implement this method. However, depending
> +	 * on the scheduling logic, this can lead to confusing behaviors - e.g.
> +	 * scheduling position not being updated across a priority change.
> +	 */
> +	void (*dequeue)(struct task_struct *p, u64 deq_flags);
> +
> +	/**
> +	 * dispatch - Dispatch tasks from the BPF scheduler and/or consume DSQs
> +	 * @cpu: CPU to dispatch tasks for
> +	 * @prev: previous task being switched out
> +	 *
> +	 * Called when a CPU's local dsq is empty. The operation should dispatch
> +	 * one or more tasks from the BPF scheduler into the DSQs using
> +	 * scx_bpf_dispatch() and/or consume user DSQs into the local DSQ using
> +	 * scx_bpf_consume().
> +	 *
> +	 * The maximum number of times scx_bpf_dispatch() can be called without
> +	 * an intervening scx_bpf_consume() is specified by
> +	 * ops.dispatch_max_batch. See the comments on top of the two functions
> +	 * for more details.
> +	 *
> +	 * When not %NULL, @prev is an SCX task with its slice depleted. If
> +	 * @prev is still runnable as indicated by set %SCX_TASK_QUEUED in
> +	 * @prev->scx.flags, it is not enqueued yet and will be enqueued after
> +	 * ops.dispatch() returns. To keep executing @prev, return without
> +	 * dispatching or consuming any tasks. Also see %SCX_OPS_ENQ_LAST.
> +	 */
> +	void (*dispatch)(s32 cpu, struct task_struct *prev);
> +
> +	/**
> +	 * yield - Yield CPU
> +	 * @from: yielding task
> +	 * @to: optional yield target task
> +	 *
> +	 * If @to is NULL, @from is yielding the CPU to other runnable tasks.
> +	 * The BPF scheduler should ensure that other available tasks are
> +	 * dispatched before the yielding task. Return value is ignored in this
> +	 * case.
> +	 *
> +	 * If @to is not-NULL, @from wants to yield the CPU to @to. If the bpf
> +	 * scheduler can implement the request, return %true; otherwise, %false.
> +	 */
> +	bool (*yield)(struct task_struct *from, struct task_struct *to);
> +
> +	/**
> +	 * set_weight - Set task weight
> +	 * @p: task to set weight for
> +	 * @weight: new eight [1..10000]
> +	 *
> +	 * Update @p's weight to @weight.
> +	 */
> +	void (*set_weight)(struct task_struct *p, u32 weight);
> +
> +	/**
> +	 * set_cpumask - Set CPU affinity
> +	 * @p: task to set CPU affinity for
> +	 * @cpumask: cpumask of cpus that @p can run on
> +	 *
> +	 * Update @p's CPU affinity to @cpumask.
> +	 */
> +	void (*set_cpumask)(struct task_struct *p,
> +			    const struct cpumask *cpumask);
> +
> +	/**
> +	 * update_idle - Update the idle state of a CPU
> +	 * @cpu: CPU to udpate the idle state for
> +	 * @idle: whether entering or exiting the idle state
> +	 *
> +	 * This operation is called when @rq's CPU goes or leaves the idle
> +	 * state. By default, implementing this operation disables the built-in
> +	 * idle CPU tracking and the following helpers become unavailable:
> +	 *
> +	 * - scx_bpf_select_cpu_dfl()
> +	 * - scx_bpf_test_and_clear_cpu_idle()
> +	 * - scx_bpf_pick_idle_cpu()
> +	 *
> +	 * The user also must implement ops.select_cpu() as the default
> +	 * implementation relies on scx_bpf_select_cpu_dfl().
> +	 *
> +	 * Specify the %SCX_OPS_KEEP_BUILTIN_IDLE flag to keep the built-in idle
> +	 * tracking.
> +	 */
> +	void (*update_idle)(s32 cpu, bool idle);
> +
> +	/**
> +	 * prep_enable - Prepare to enable BPF scheduling for a task
> +	 * @p: task to prepare BPF scheduling for
> +	 * @args: enable arguments, see the struct definition
> +	 *
> +	 * Either we're loading a BPF scheduler or a new task is being forked.
> +	 * Prepare BPF scheduling for @p. This operation may block and can be
> +	 * used for allocations.
> +	 *
> +	 * Return 0 for success, -errno for failure. An error return while
> +	 * loading will abort loading of the BPF scheduler. During a fork, will
> +	 * abort the specific fork.
> +	 */
> +	s32 (*prep_enable)(struct task_struct *p, struct scx_enable_args *args);
> +
> +	/**
> +	 * enable - Enable BPF scheduling for a task
> +	 * @p: task to enable BPF scheduling for
> +	 * @args: enable arguments, see the struct definition
> +	 *
> +	 * Enable @p for BPF scheduling. @p will start running soon.
> +	 */
> +	void (*enable)(struct task_struct *p, struct scx_enable_args *args);
> +
> +	/**
> +	 * cancel_enable - Cancel prep_enable()
> +	 * @p: task being canceled
> +	 * @args: enable arguments, see the struct definition
> +	 *
> +	 * @p was prep_enable()'d but failed before reaching enable(). Undo the
> +	 * preparation.
> +	 */
> +	void (*cancel_enable)(struct task_struct *p,
> +			      struct scx_enable_args *args);
> +
> +	/**
> +	 * disable - Disable BPF scheduling for a task
> +	 * @p: task to disable BPF scheduling for
> +	 *
> +	 * @p is exiting, leaving SCX or the BPF scheduler is being unloaded.
> +	 * Disable BPF scheduling for @p.
> +	 */
> +	void (*disable)(struct task_struct *p);
> +
> +	/*
> +	 * All online ops must come before ops.init().
> +	 */
> +
> +	/**
> +	 * init - Initialize the BPF scheduler
> +	 */
> +	s32 (*init)(void);
> +
> +	/**
> +	 * exit - Clean up after the BPF scheduler
> +	 * @info: Exit info
> +	 */
> +	void (*exit)(struct scx_exit_info *info);
> +
> +	/**
> +	 * dispatch_max_batch - Max nr of tasks that dispatch() can dispatch
> +	 */
> +	u32 dispatch_max_batch;
> +
> +	/**
> +	 * flags - %SCX_OPS_* flags
> +	 */
> +	u64 flags;
> +
> +	/**
> +	 * name - BPF scheduler's name
> +	 *
> +	 * Must be a non-zero valid BPF object name including only isalnum(),
> +	 * '_' and '.' chars. Shows up in kernel.sched_ext_ops sysctl while the
> +	 * BPF scheduler is enabled.
> +	 */
> +	char name[SCX_OPS_NAME_LEN];
> +};
> +
> +/*
> + * Dispatch queue (dsq) is a simple FIFO which is used to buffer between the
> + * scheduler core and the BPF scheduler. See the documentation for more details.
> + */
> +struct scx_dispatch_q {
> +	raw_spinlock_t		lock;
> +	struct list_head	fifo;	/* processed in dispatching order */
> +	u32			nr;
> +	u64			id;
> +	struct rhash_head	hash_node;
> +	struct llist_node	free_node;
> +	struct rcu_head		rcu;
> +};
> +
> +/* scx_entity.flags */
> +enum scx_ent_flags {
> +	SCX_TASK_QUEUED		= 1 << 0, /* on ext runqueue */
> +	SCX_TASK_BAL_KEEP	= 1 << 1, /* balance decided to keep current */
> +	SCX_TASK_ENQ_LOCAL	= 1 << 2, /* used by scx_select_cpu_dfl() to set SCX_ENQ_LOCAL */
> +
> +	SCX_TASK_OPS_PREPPED	= 1 << 8, /* prepared for BPF scheduler enable */
> +	SCX_TASK_OPS_ENABLED	= 1 << 9, /* task has BPF scheduler enabled */
> +
> +	SCX_TASK_DEQD_FOR_SLEEP	= 1 << 17, /* last dequeue was for SLEEP */
> +
> +	SCX_TASK_CURSOR		= 1 << 31, /* iteration cursor, not a task */
> +};
> +
> +/*
> + * Mask bits for scx_entity.kf_mask. Not all kfuncs can be called from
> + * everywhere and the following bits track which kfunc sets are currently
> + * allowed for %current. This simple per-task tracking works because SCX ops
> + * nest in a limited way. BPF will likely implement a way to allow and disallow
> + * kfuncs depending on the calling context which will replace this manual
> + * mechanism. See scx_kf_allow().
> + */
> +enum scx_kf_mask {
> +	SCX_KF_UNLOCKED		= 0,	  /* not sleepable, not rq locked */
> +	/* all non-sleepables may be nested inside INIT and SLEEPABLE */
> +	SCX_KF_INIT		= 1 << 0, /* running ops.init() */
> +	SCX_KF_SLEEPABLE	= 1 << 1, /* other sleepable init operations */
> +	/* ops.dequeue (in REST) may be nested inside DISPATCH */
> +	SCX_KF_DISPATCH		= 1 << 3, /* ops.dispatch() */
> +	SCX_KF_ENQUEUE		= 1 << 4, /* ops.enqueue() */
> +	SCX_KF_REST		= 1 << 5, /* other rq-locked operations */
> +
> +	__SCX_KF_RQ_LOCKED	= SCX_KF_DISPATCH | SCX_KF_ENQUEUE | SCX_KF_REST,
> +};
> +
> +/*
> + * The following is embedded in task_struct and contains all fields necessary
> + * for a task to be scheduled by SCX.
> + */
> +struct sched_ext_entity {
> +	struct scx_dispatch_q	*dsq;
> +	struct list_head	dsq_node;
> +	u32			flags;		/* protected by rq lock */
> +	u32			weight;
> +	s32			sticky_cpu;
> +	s32			holding_cpu;
> +	u32			kf_mask;	/* see scx_kf_mask above */
> +	atomic_long_t		ops_state;
> +
> +	/* BPF scheduler modifiable fields */
> +
> +	/*
> +	 * Runtime budget in nsecs. This is usually set through
> +	 * scx_bpf_dispatch() but can also be modified directly by the BPF
> +	 * scheduler. Automatically decreased by SCX as the task executes. On
> +	 * depletion, a scheduling event is triggered.
> +	 */
> +	u64			slice;
> +
> +	/* cold fields */
> +	struct list_head	tasks_node;
> +};
> +
> +void sched_ext_free(struct task_struct *p);
> +
>   #else	/* !CONFIG_SCHED_CLASS_EXT */
>   
>   static inline void sched_ext_free(struct task_struct *p) {}
> diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
> index 3bac0a8ceab2..359a14cc76a4 100644
> --- a/include/uapi/linux/sched.h
> +++ b/include/uapi/linux/sched.h
> @@ -118,6 +118,7 @@ struct clone_args {
>   /* SCHED_ISO: reserved but not implemented yet */
>   #define SCHED_IDLE		5
>   #define SCHED_DEADLINE		6
> +#define SCHED_EXT		7
>   
>   /* Can be ORed in to make sure the process is reverted back to SCHED_NORMAL on fork */
>   #define SCHED_RESET_ON_FORK     0x40000000
> diff --git a/init/init_task.c b/init/init_task.c
> index f703116e0523..7eaf8b429f82 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -6,6 +6,7 @@
>   #include <linux/sched/sysctl.h>
>   #include <linux/sched/rt.h>
>   #include <linux/sched/task.h>
> +#include <linux/sched/ext.h>
>   #include <linux/init.h>
>   #include <linux/fs.h>
>   #include <linux/mm.h>
> @@ -102,6 +103,15 @@ struct task_struct init_task
>   #endif
>   #ifdef CONFIG_CGROUP_SCHED
>   	.sched_task_group = &root_task_group,
> +#endif
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	.scx		= {
> +		.dsq_node	= LIST_HEAD_INIT(init_task.scx.dsq_node),
> +		.sticky_cpu	= -1,
> +		.holding_cpu	= -1,
> +		.ops_state	= ATOMIC_INIT(0),
> +		.slice		= SCX_SLICE_DFL,
> +	},
>   #endif
>   	.ptraced	= LIST_HEAD_INIT(init_task.ptraced),
>   	.ptrace_entry	= LIST_HEAD_INIT(init_task.ptrace_entry),
> diff --git a/kernel/Kconfig.preempt b/kernel/Kconfig.preempt
> index c2f1fd95a821..0afcda19bc50 100644
> --- a/kernel/Kconfig.preempt
> +++ b/kernel/Kconfig.preempt
> @@ -133,4 +133,24 @@ config SCHED_CORE
>   	  which is the likely usage by Linux distributions, there should
>   	  be no measurable impact on performance.
>   
> -
> +config SCHED_CLASS_EXT
> +	bool "Extensible Scheduling Class"
> +	depends on BPF_SYSCALL && BPF_JIT && !SCHED_CORE
> +	help
> +	  This option enables a new scheduler class sched_ext (SCX), which
> +	  allows scheduling policies to be implemented as BPF programs to
> +	  achieve the following:
> +
> +	  - Ease of experimentation and exploration: Enabling rapid
> +	    iteration of new scheduling policies.
> +	  - Customization: Building application-specific schedulers which
> +	    implement policies that are not applicable to general-purpose
> +	    schedulers.
> +	  - Rapid scheduler deployments: Non-disruptive swap outs of
> +	    scheduling policies in production environments.
> +
> +	  sched_ext leverages BPF’s struct_ops feature to define a structure
> +	  which exports function callbacks and flags to BPF programs that
> +	  wish to implement scheduling policies. The struct_ops structure
> +	  exported by sched_ext is struct sched_ext_ops, and is conceptually
> +	  similar to struct sched_class.
> diff --git a/kernel/bpf/bpf_struct_ops_types.h b/kernel/bpf/bpf_struct_ops_types.h
> index 5678a9ddf817..3618769d853d 100644
> --- a/kernel/bpf/bpf_struct_ops_types.h
> +++ b/kernel/bpf/bpf_struct_ops_types.h
> @@ -9,4 +9,8 @@ BPF_STRUCT_OPS_TYPE(bpf_dummy_ops)
>   #include <net/tcp.h>
>   BPF_STRUCT_OPS_TYPE(tcp_congestion_ops)
>   #endif
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +#include <linux/sched/ext.h>
> +BPF_STRUCT_OPS_TYPE(sched_ext_ops)
> +#endif
>   #endif
> diff --git a/kernel/sched/build_policy.c b/kernel/sched/build_policy.c
> index d9dc9ab3773f..4c658b21f603 100644
> --- a/kernel/sched/build_policy.c
> +++ b/kernel/sched/build_policy.c
> @@ -28,6 +28,7 @@
>   #include <linux/suspend.h>
>   #include <linux/tsacct_kern.h>
>   #include <linux/vtime.h>
> +#include <linux/percpu-rwsem.h>
>   
>   #include <uapi/linux/sched/types.h>
>   
> @@ -52,3 +53,6 @@
>   #include "cputime.c"
>   #include "deadline.c"
>   
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +# include "ext.c"
> +#endif
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 22ce11c3a115..21307eb284c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -3961,6 +3961,15 @@ bool cpus_share_resources(int this_cpu, int that_cpu)
>   
>   static inline bool ttwu_queue_cond(struct task_struct *p, int cpu)
>   {
> +	/*
> +	 * The BPF scheduler may depend on select_task_rq() being invoked during
> +	 * wakeups. In addition, @p may end up executing on a different CPU
> +	 * regardless of what happens in the wakeup path making the ttwu_queue
> +	 * optimization less meaningful. Skip if on SCX.
> +	 */
> +	if (task_on_scx(p))
> +		return false;
> +
>   	/*
>   	 * Do not complicate things with the async wake_list while the CPU is
>   	 * in hotplug state.
> @@ -4531,6 +4540,18 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
>   	p->rt.on_rq		= 0;
>   	p->rt.on_list		= 0;
>   
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	p->scx.dsq		= NULL;
> +	INIT_LIST_HEAD(&p->scx.dsq_node);
> +	p->scx.flags		= 0;
> +	p->scx.weight		= 0;
> +	p->scx.sticky_cpu	= -1;
> +	p->scx.holding_cpu	= -1;
> +	p->scx.kf_mask		= 0;
> +	atomic64_set(&p->scx.ops_state, 0);
> +	p->scx.slice		= SCX_SLICE_DFL;
> +#endif
> +
>   #ifdef CONFIG_PREEMPT_NOTIFIERS
>   	INIT_HLIST_HEAD(&p->preempt_notifiers);
>   #endif
> @@ -4779,6 +4800,10 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
>   		goto out_cancel;
>   	} else if (rt_prio(p->prio)) {
>   		p->sched_class = &rt_sched_class;
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	} else if (task_should_scx(p)) {
> +		p->sched_class = &ext_sched_class;
> +#endif
>   	} else {
>   		p->sched_class = &fair_sched_class;
>   	}
> @@ -7059,6 +7084,10 @@ void __setscheduler_prio(struct task_struct *p, int prio)
>   		p->sched_class = &dl_sched_class;
>   	else if (rt_prio(prio))
>   		p->sched_class = &rt_sched_class;
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	else if (task_should_scx(p))
> +		p->sched_class = &ext_sched_class;
> +#endif
>   	else
>   		p->sched_class = &fair_sched_class;
>   
> @@ -9055,6 +9084,7 @@ SYSCALL_DEFINE1(sched_get_priority_max, int, policy)
>   	case SCHED_NORMAL:
>   	case SCHED_BATCH:
>   	case SCHED_IDLE:
> +	case SCHED_EXT:
>   		ret = 0;
>   		break;
>   	}
> @@ -9082,6 +9112,7 @@ SYSCALL_DEFINE1(sched_get_priority_min, int, policy)
>   	case SCHED_NORMAL:
>   	case SCHED_BATCH:
>   	case SCHED_IDLE:
> +	case SCHED_EXT:
>   		ret = 0;
>   	}
>   	return ret;
> @@ -9918,6 +9949,10 @@ void __init sched_init(void)
>   	BUG_ON(!sched_class_above(&dl_sched_class, &rt_sched_class));
>   	BUG_ON(!sched_class_above(&rt_sched_class, &fair_sched_class));
>   	BUG_ON(!sched_class_above(&fair_sched_class, &idle_sched_class));
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	BUG_ON(!sched_class_above(&fair_sched_class, &ext_sched_class));
> +	BUG_ON(!sched_class_above(&ext_sched_class, &idle_sched_class));
> +#endif
>   
>   	wait_bit_init();
>   
> @@ -12047,3 +12082,38 @@ void sched_mm_cid_fork(struct task_struct *t)
>   	t->mm_cid_active = 1;
>   }
>   #endif
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
> +			    struct sched_enq_and_set_ctx *ctx)
> +{
> +	struct rq *rq = task_rq(p);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	*ctx = (struct sched_enq_and_set_ctx){
> +		.p = p,
> +		.queue_flags = queue_flags,
> +		.queued = task_on_rq_queued(p),
> +		.running = task_current(rq, p),
> +	};
> +
> +	update_rq_clock(rq);
> +	if (ctx->queued)
> +		dequeue_task(rq, p, queue_flags | DEQUEUE_NOCLOCK);
> +	if (ctx->running)
> +		put_prev_task(rq, p);
> +}
> +
> +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx)
> +{
> +	struct rq *rq = task_rq(ctx->p);
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (ctx->queued)
> +		enqueue_task(rq, ctx->p, ctx->queue_flags | ENQUEUE_NOCLOCK);
> +	if (ctx->running)
> +		set_next_task(rq, ctx->p);
> +}
> +#endif	/* CONFIG_SCHED_CLASS_EXT */
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 4580a450700e..6587a45ffe96 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -374,6 +374,9 @@ static __init int sched_init_debug(void)
>   
>   	debugfs_create_file("debug", 0444, debugfs_sched, NULL, &sched_debug_fops);
>   
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	debugfs_create_file("ext", 0444, debugfs_sched, NULL, &sched_ext_fops);
> +#endif
>   	return 0;
>   }
>   late_initcall(sched_init_debug);
> @@ -1085,6 +1088,9 @@ void proc_sched_show_task(struct task_struct *p, struct pid_namespace *ns,
>   		P(dl.runtime);
>   		P(dl.deadline);
>   	}
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	__PS("ext.enabled", task_on_scx(p));
> +#endif
>   #undef PN_SCHEDSTAT
>   #undef P_SCHEDSTAT
>   
> diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
> new file mode 100644
> index 000000000000..7b78f77d2293
> --- /dev/null
> +++ b/kernel/sched/ext.c
> @@ -0,0 +1,3158 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2022 Tejun Heo <tj@...nel.org>
> + * Copyright (c) 2022 David Vernet <dvernet@...a.com>
> + */
> +#define SCX_OP_IDX(op)		(offsetof(struct sched_ext_ops, op) / sizeof(void (*)(void)))
> +
> +enum scx_internal_consts {
> +	SCX_NR_ONLINE_OPS	= SCX_OP_IDX(init),
> +	SCX_DSP_DFL_MAX_BATCH	= 32,
> +};
> +
> +enum scx_ops_enable_state {
> +	SCX_OPS_PREPPING,
> +	SCX_OPS_ENABLING,
> +	SCX_OPS_ENABLED,
> +	SCX_OPS_DISABLING,
> +	SCX_OPS_DISABLED,
> +};
> +
> +/*
> + * sched_ext_entity->ops_state
> + *
> + * Used to track the task ownership between the SCX core and the BPF scheduler.
> + * State transitions look as follows:
> + *
> + * NONE -> QUEUEING -> QUEUED -> DISPATCHING
> + *   ^              |                 |
> + *   |              v                 v
> + *   \-------------------------------/
> + *
> + * QUEUEING and DISPATCHING states can be waited upon. See wait_ops_state() call
> + * sites for explanations on the conditions being waited upon and why they are
> + * safe. Transitions out of them into NONE or QUEUED must store_release and the
> + * waiters should load_acquire.
> + *
> + * Tracking scx_ops_state enables sched_ext core to reliably determine whether
> + * any given task can be dispatched by the BPF scheduler at all times and thus
> + * relaxes the requirements on the BPF scheduler. This allows the BPF scheduler
> + * to try to dispatch any task anytime regardless of its state as the SCX core
> + * can safely reject invalid dispatches.
> + */
> +enum scx_ops_state {
> +	SCX_OPSS_NONE,		/* owned by the SCX core */
> +	SCX_OPSS_QUEUEING,	/* in transit to the BPF scheduler */
> +	SCX_OPSS_QUEUED,	/* owned by the BPF scheduler */
> +	SCX_OPSS_DISPATCHING,	/* in transit back to the SCX core */
> +
> +	/*
> +	 * QSEQ brands each QUEUED instance so that, when dispatch races
> +	 * dequeue/requeue, the dispatcher can tell whether it still has a claim
> +	 * on the task being dispatched.
> +	 *
> +	 * As some 32bit archs can't do 64bit store_release/load_acquire,
> +	 * p->scx.ops_state is atomic_long_t which leaves 30 bits for QSEQ on
> +	 * 32bit machines. The dispatch race window QSEQ protects is very narrow
> +	 * and runs with IRQ disabled. 30 bits should be sufficient.
> +	 */
> +	SCX_OPSS_QSEQ_SHIFT	= 2,
> +};
> +
> +/* Use macros to ensure that the type is unsigned long for the masks */
> +#define SCX_OPSS_STATE_MASK	((1LU << SCX_OPSS_QSEQ_SHIFT) - 1)
> +#define SCX_OPSS_QSEQ_MASK	(~SCX_OPSS_STATE_MASK)
> +
> +/*
> + * During exit, a task may schedule after losing its PIDs. When disabling the
> + * BPF scheduler, we need to be able to iterate tasks in every state to
> + * guarantee system safety. Maintain a dedicated task list which contains every
> + * task between its fork and eventual free.
> + */
> +static DEFINE_SPINLOCK(scx_tasks_lock);
> +static LIST_HEAD(scx_tasks);
> +
> +/* ops enable/disable */
> +static struct kthread_worker *scx_ops_helper;
> +static DEFINE_MUTEX(scx_ops_enable_mutex);
> +DEFINE_STATIC_KEY_FALSE(__scx_ops_enabled);
> +DEFINE_STATIC_PERCPU_RWSEM(scx_fork_rwsem);
> +static atomic_t scx_ops_enable_state_var = ATOMIC_INIT(SCX_OPS_DISABLED);
> +static struct sched_ext_ops scx_ops;
> +static bool scx_warned_zero_slice;
> +
> +static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_last);
> +static DEFINE_STATIC_KEY_FALSE(scx_ops_enq_exiting);
> +static DEFINE_STATIC_KEY_FALSE(scx_builtin_idle_enabled);
> +
> +struct static_key_false scx_has_op[SCX_NR_ONLINE_OPS] =
> +	{ [0 ... SCX_NR_ONLINE_OPS-1] = STATIC_KEY_FALSE_INIT };
> +
> +static atomic_t scx_exit_kind = ATOMIC_INIT(SCX_EXIT_DONE);
> +static struct scx_exit_info scx_exit_info;
> +
> +/* idle tracking */
> +#ifdef CONFIG_SMP
> +#ifdef CONFIG_CPUMASK_OFFSTACK
> +#define CL_ALIGNED_IF_ONSTACK
> +#else
> +#define CL_ALIGNED_IF_ONSTACK __cacheline_aligned_in_smp
> +#endif
> +
> +static struct {
> +	cpumask_var_t cpu;
> +	cpumask_var_t smt;
> +} idle_masks CL_ALIGNED_IF_ONSTACK;
> +
> +#endif	/* CONFIG_SMP */
> +
> +/*
> + * Direct dispatch marker.
> + *
> + * Non-NULL values are used for direct dispatch from enqueue path. A valid
> + * pointer points to the task currently being enqueued. An ERR_PTR value is used
> + * to indicate that direct dispatch has already happened.
> + */
> +static DEFINE_PER_CPU(struct task_struct *, direct_dispatch_task);
> +
> +/* dispatch queues */
> +static struct scx_dispatch_q __cacheline_aligned_in_smp scx_dsq_global;
> +
> +static const struct rhashtable_params dsq_hash_params = {
> +	.key_len		= 8,
> +	.key_offset		= offsetof(struct scx_dispatch_q, id),
> +	.head_offset		= offsetof(struct scx_dispatch_q, hash_node),
> +};
> +
> +static struct rhashtable dsq_hash;
> +static LLIST_HEAD(dsqs_to_free);
> +
> +/* dispatch buf */
> +struct scx_dsp_buf_ent {
> +	struct task_struct	*task;
> +	unsigned long		qseq;
> +	u64			dsq_id;
> +	u64			enq_flags;
> +};
> +
> +static u32 scx_dsp_max_batch;
> +static struct scx_dsp_buf_ent __percpu *scx_dsp_buf;
> +
> +struct scx_dsp_ctx {
> +	struct rq		*rq;
> +	struct rq_flags		*rf;
> +	u32			buf_cursor;
> +	u32			nr_tasks;
> +};
> +
> +static DEFINE_PER_CPU(struct scx_dsp_ctx, scx_dsp_ctx);
> +
> +void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
> +		      u64 enq_flags);
> +__printf(2, 3) static void scx_ops_error_kind(enum scx_exit_kind kind,
> +					      const char *fmt, ...);
> +#define scx_ops_error(fmt, args...)						\
> +	scx_ops_error_kind(SCX_EXIT_ERROR, fmt, ##args)
> +
> +struct scx_task_iter {
> +	struct sched_ext_entity		cursor;
> +	struct task_struct		*locked;
> +	struct rq			*rq;
> +	struct rq_flags			rf;
> +};
> +
> +#define SCX_HAS_OP(op)	static_branch_likely(&scx_has_op[SCX_OP_IDX(op)])
> +
> +/* if the highest set bit is N, return a mask with bits [N+1, 31] set */
> +static u32 higher_bits(u32 flags)
> +{
> +	return ~((1 << fls(flags)) - 1);
> +}
> +
> +/* return the mask with only the highest bit set */
> +static u32 highest_bit(u32 flags)
> +{
> +	int bit = fls(flags);
> +	return bit ? 1 << (bit - 1) : 0;
> +}
> +
> +/*
> + * scx_kf_mask enforcement. Some kfuncs can only be called from specific SCX
> + * ops. When invoking SCX ops, SCX_CALL_OP[_RET]() should be used to indicate
> + * the allowed kfuncs and those kfuncs should use scx_kf_allowed() to check
> + * whether it's running from an allowed context.
> + *
> + * @mask is constant, always inline to cull the mask calculations.
> + */
> +static __always_inline void scx_kf_allow(u32 mask)
> +{
> +	/* nesting is allowed only in increasing scx_kf_mask order */
> +	WARN_ONCE((mask | higher_bits(mask)) & current->scx.kf_mask,
> +		  "invalid nesting current->scx.kf_mask=0x%x mask=0x%x\n",
> +		  current->scx.kf_mask, mask);
> +	current->scx.kf_mask |= mask;
> +}
> +
> +static void scx_kf_disallow(u32 mask)
> +{
> +	current->scx.kf_mask &= ~mask;
> +}
> +
> +#define SCX_CALL_OP(mask, op, args...)						\
> +do {										\
> +	if (mask) {								\
> +		scx_kf_allow(mask);						\
> +		scx_ops.op(args);						\
> +		scx_kf_disallow(mask);						\
> +	} else {								\
> +		scx_ops.op(args);						\
> +	}									\
> +} while (0)
> +
> +#define SCX_CALL_OP_RET(mask, op, args...)					\
> +({										\
> +	__typeof__(scx_ops.op(args)) __ret;					\
> +	if (mask) {								\
> +		scx_kf_allow(mask);						\
> +		__ret = scx_ops.op(args);					\
> +		scx_kf_disallow(mask);						\
> +	} else {								\
> +		__ret = scx_ops.op(args);					\
> +	}									\
> +	__ret;									\
> +})
> +
> +/* @mask is constant, always inline to cull unnecessary branches */
> +static __always_inline bool scx_kf_allowed(u32 mask)
> +{
> +	if (unlikely(!(current->scx.kf_mask & mask))) {
> +		scx_ops_error("kfunc with mask 0x%x called from an operation only allowing 0x%x",
> +			      mask, current->scx.kf_mask);
> +		return false;
> +	}
> +
> +	if (unlikely((mask & (SCX_KF_INIT | SCX_KF_SLEEPABLE)) &&
> +		     in_interrupt())) {
> +		scx_ops_error("sleepable kfunc called from non-sleepable context");
> +		return false;
> +	}
> +
> +	/*
> +	 * Enforce nesting boundaries. e.g. A kfunc which can be called from
> +	 * DISPATCH must not be called if we're running DEQUEUE which is nested
> +	 * inside ops.dispatch(). We don't need to check the SCX_KF_SLEEPABLE
> +	 * boundary thanks to the above in_interrupt() check.
> +	 */
> +	if (unlikely(highest_bit(mask) == SCX_KF_DISPATCH &&
> +		     (current->scx.kf_mask & higher_bits(SCX_KF_DISPATCH)))) {
> +		scx_ops_error("dispatch kfunc called from a nested operation");
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +/**
> + * scx_task_iter_init - Initialize a task iterator
> + * @iter: iterator to init
> + *
> + * Initialize @iter. Must be called with scx_tasks_lock held. Once initialized,
> + * @iter must eventually be exited with scx_task_iter_exit().
> + *
> + * scx_tasks_lock may be released between this and the first next() call or
> + * between any two next() calls. If scx_tasks_lock is released between two
> + * next() calls, the caller is responsible for ensuring that the task being
> + * iterated remains accessible either through RCU read lock or obtaining a
> + * reference count.
> + *
> + * All tasks which existed when the iteration started are guaranteed to be
> + * visited as long as they still exist.
> + */
> +static void scx_task_iter_init(struct scx_task_iter *iter)
> +{
> +	lockdep_assert_held(&scx_tasks_lock);
> +
> +	iter->cursor = (struct sched_ext_entity){ .flags = SCX_TASK_CURSOR };
> +	list_add(&iter->cursor.tasks_node, &scx_tasks);
> +	iter->locked = NULL;
> +}
> +
> +/**
> + * scx_task_iter_exit - Exit a task iterator
> + * @iter: iterator to exit
> + *
> + * Exit a previously initialized @iter. Must be called with scx_tasks_lock held.
> + * If the iterator holds a task's rq lock, that rq lock is released. See
> + * scx_task_iter_init() for details.
> + */
> +static void scx_task_iter_exit(struct scx_task_iter *iter)
> +{
> +	struct list_head *cursor = &iter->cursor.tasks_node;
> +
> +	lockdep_assert_held(&scx_tasks_lock);
> +
> +	if (iter->locked) {
> +		task_rq_unlock(iter->rq, iter->locked, &iter->rf);
> +		iter->locked = NULL;
> +	}
> +
> +	if (list_empty(cursor))
> +		return;
> +
> +	list_del_init(cursor);
> +}
> +
> +/**
> + * scx_task_iter_next - Next task
> + * @iter: iterator to walk
> + *
> + * Visit the next task. See scx_task_iter_init() for details.
> + */
> +static struct task_struct *scx_task_iter_next(struct scx_task_iter *iter)
> +{
> +	struct list_head *cursor = &iter->cursor.tasks_node;
> +	struct sched_ext_entity *pos;
> +
> +	lockdep_assert_held(&scx_tasks_lock);
> +
> +	list_for_each_entry(pos, cursor, tasks_node) {
> +		if (&pos->tasks_node == &scx_tasks)
> +			return NULL;
> +		if (!(pos->flags & SCX_TASK_CURSOR)) {
> +			list_move(cursor, &pos->tasks_node);
> +			return container_of(pos, struct task_struct, scx);
> +		}
> +	}
> +
> +	/* can't happen, should always terminate at scx_tasks above */
> +	BUG();
> +}
> +
> +/**
> + * scx_task_iter_next_filtered - Next non-idle task
> + * @iter: iterator to walk
> + *
> + * Visit the next non-idle task. See scx_task_iter_init() for details.
> + */
> +static struct task_struct *
> +scx_task_iter_next_filtered(struct scx_task_iter *iter)
> +{
> +	struct task_struct *p;
> +
> +	while ((p = scx_task_iter_next(iter))) {
> +		/*
> +		 * is_idle_task() tests %PF_IDLE which may not be set for CPUs
> +		 * which haven't yet been onlined. Test sched_class directly.
> +		 */
> +		if (p->sched_class != &idle_sched_class)
> +			return p;
> +	}
> +	return NULL;
> +}
> +
> +/**
> + * scx_task_iter_next_filtered_locked - Next non-idle task with its rq locked
> + * @iter: iterator to walk
> + *
> + * Visit the next non-idle task with its rq lock held. See scx_task_iter_init()
> + * for details.
> + */
> +static struct task_struct *
> +scx_task_iter_next_filtered_locked(struct scx_task_iter *iter)
> +{
> +	struct task_struct *p;
> +
> +	if (iter->locked) {
> +		task_rq_unlock(iter->rq, iter->locked, &iter->rf);
> +		iter->locked = NULL;
> +	}
> +
> +	p = scx_task_iter_next_filtered(iter);
> +	if (!p)
> +		return NULL;
> +
> +	iter->rq = task_rq_lock(p, &iter->rf);
> +	iter->locked = p;
> +	return p;
> +}
> +
> +static enum scx_ops_enable_state scx_ops_enable_state(void)
> +{
> +	return atomic_read(&scx_ops_enable_state_var);
> +}
> +
> +static enum scx_ops_enable_state
> +scx_ops_set_enable_state(enum scx_ops_enable_state to)
> +{
> +	return atomic_xchg(&scx_ops_enable_state_var, to);
> +}
> +
> +static bool scx_ops_tryset_enable_state(enum scx_ops_enable_state to,
> +					enum scx_ops_enable_state from)
> +{
> +	int from_v = from;
> +
> +	return atomic_try_cmpxchg(&scx_ops_enable_state_var, &from_v, to);
> +}
> +
> +static bool scx_ops_disabling(void)
> +{
> +	return unlikely(scx_ops_enable_state() == SCX_OPS_DISABLING);
> +}
> +
> +/**
> + * wait_ops_state - Busy-wait the specified ops state to end
> + * @p: target task
> + * @opss: state to wait the end of
> + *
> + * Busy-wait for @p to transition out of @opss. This can only be used when the
> + * state part of @opss is %SCX_QUEUEING or %SCX_DISPATCHING. This function also
> + * has load_acquire semantics to ensure that the caller can see the updates made
> + * in the enqueueing and dispatching paths.
> + */
> +static void wait_ops_state(struct task_struct *p, unsigned long opss)
> +{
> +	do {
> +		cpu_relax();
> +	} while (atomic_long_read_acquire(&p->scx.ops_state) == opss);
> +}
> +
> +/**
> + * ops_cpu_valid - Verify a cpu number
> + * @cpu: cpu number which came from a BPF ops
> + *
> + * @cpu is a cpu number which came from the BPF scheduler and can be any value.
> + * Verify that it is in range and one of the possible cpus.
> + */
> +static bool ops_cpu_valid(s32 cpu)
> +{
> +	return likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu));
> +}
> +
> +/**
> + * ops_sanitize_err - Sanitize a -errno value
> + * @ops_name: operation to blame on failure
> + * @err: -errno value to sanitize
> + *
> + * Verify @err is a valid -errno. If not, trigger scx_ops_error() and return
> + * -%EPROTO. This is necessary because returning a rogue -errno up the chain can
> + * cause misbehaviors. For an example, a large negative return from
> + * ops.prep_enable() triggers an oops when passed up the call chain because the
> + * value fails IS_ERR() test after being encoded with ERR_PTR() and then is
> + * handled as a pointer.
> + */
> +static int ops_sanitize_err(const char *ops_name, s32 err)
> +{
> +	if (err < 0 && err >= -MAX_ERRNO)
> +		return err;
> +
> +	scx_ops_error("ops.%s() returned an invalid errno %d", ops_name, err);
> +	return -EPROTO;
> +}
> +
> +static void update_curr_scx(struct rq *rq)
> +{
> +	struct task_struct *curr = rq->curr;
> +	u64 now = rq_clock_task(rq);
> +	u64 delta_exec;
> +
> +	if (time_before_eq64(now, curr->se.exec_start))
> +		return;
> +
> +	delta_exec = now - curr->se.exec_start;
> +	curr->se.exec_start = now;
> +	curr->se.sum_exec_runtime += delta_exec;
> +	account_group_exec_runtime(curr, delta_exec);
> +	cgroup_account_cputime(curr, delta_exec);
> +
> +	curr->scx.slice -= min(curr->scx.slice, delta_exec);
> +}
> +
> +static void dispatch_enqueue(struct scx_dispatch_q *dsq, struct task_struct *p,
> +			     u64 enq_flags)
> +{
> +	bool is_local = dsq->id == SCX_DSQ_LOCAL;
> +
> +	WARN_ON_ONCE(p->scx.dsq || !list_empty(&p->scx.dsq_node));
> +
> +	if (!is_local) {
> +		raw_spin_lock(&dsq->lock);
> +		if (unlikely(dsq->id == SCX_DSQ_INVALID)) {
> +			scx_ops_error("attempting to dispatch to a destroyed dsq");
> +			/* fall back to the global dsq */
> +			raw_spin_unlock(&dsq->lock);
> +			dsq = &scx_dsq_global;
> +			raw_spin_lock(&dsq->lock);
> +		}
> +	}
> +
> +	if (enq_flags & SCX_ENQ_HEAD)
> +		list_add(&p->scx.dsq_node, &dsq->fifo);
> +	else
> +		list_add_tail(&p->scx.dsq_node, &dsq->fifo);
> +	dsq->nr++;
> +	p->scx.dsq = dsq;
> +
> +	/*
> +	 * We're transitioning out of QUEUEING or DISPATCHING. store_release to
> +	 * match waiters' load_acquire.
> +	 */
> +	if (enq_flags & SCX_ENQ_CLEAR_OPSS)
> +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> +
> +	if (is_local) {
> +		struct rq *rq = container_of(dsq, struct rq, scx.local_dsq);
> +
> +		if (sched_class_above(&ext_sched_class, rq->curr->sched_class))
> +			resched_curr(rq);
> +	} else {
> +		raw_spin_unlock(&dsq->lock);
> +	}
> +}
> +
> +static void dispatch_dequeue(struct scx_rq *scx_rq, struct task_struct *p)
> +{
> +	struct scx_dispatch_q *dsq = p->scx.dsq;
> +	bool is_local = dsq == &scx_rq->local_dsq;
> +
> +	if (!dsq) {
> +		WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
> +		/*
> +		 * When dispatching directly from the BPF scheduler to a local
> +		 * DSQ, the task isn't associated with any DSQ but
> +		 * @p->scx.holding_cpu may be set under the protection of
> +		 * %SCX_OPSS_DISPATCHING.
> +		 */
> +		if (p->scx.holding_cpu >= 0)
> +			p->scx.holding_cpu = -1;
> +		return;
> +	}
> +
> +	if (!is_local)
> +		raw_spin_lock(&dsq->lock);
> +
> +	/*
> +	 * Now that we hold @dsq->lock, @p->holding_cpu and @p->scx.dsq_node
> +	 * can't change underneath us.
> +	*/
> +	if (p->scx.holding_cpu < 0) {
> +		/* @p must still be on @dsq, dequeue */
> +		WARN_ON_ONCE(list_empty(&p->scx.dsq_node));
> +		list_del_init(&p->scx.dsq_node);
> +		dsq->nr--;
> +	} else {
> +		/*
> +		 * We're racing against dispatch_to_local_dsq() which already
> +		 * removed @p from @dsq and set @p->scx.holding_cpu. Clear the
> +		 * holding_cpu which tells dispatch_to_local_dsq() that it lost
> +		 * the race.
> +		 */
> +		WARN_ON_ONCE(!list_empty(&p->scx.dsq_node));
> +		p->scx.holding_cpu = -1;
> +	}
> +	p->scx.dsq = NULL;
> +
> +	if (!is_local)
> +		raw_spin_unlock(&dsq->lock);
> +}
> +
> +static struct scx_dispatch_q *find_non_local_dsq(u64 dsq_id)
> +{
> +	lockdep_assert(rcu_read_lock_any_held());
> +
> +	if (dsq_id == SCX_DSQ_GLOBAL)
> +		return &scx_dsq_global;
> +	else
> +		return rhashtable_lookup_fast(&dsq_hash, &dsq_id,
> +					      dsq_hash_params);
> +}
> +
> +static struct scx_dispatch_q *find_dsq_for_dispatch(struct rq *rq, u64 dsq_id,
> +						    struct task_struct *p)
> +{
> +	struct scx_dispatch_q *dsq;
> +
> +	if (dsq_id == SCX_DSQ_LOCAL)
> +		return &rq->scx.local_dsq;
> +
> +	dsq = find_non_local_dsq(dsq_id);
> +	if (unlikely(!dsq)) {
> +		scx_ops_error("non-existent DSQ 0x%llx for %s[%d]",
> +			      dsq_id, p->comm, p->pid);
> +		return &scx_dsq_global;
> +	}
> +
> +	return dsq;
> +}
> +
> +static void direct_dispatch(struct task_struct *ddsp_task, struct task_struct *p,
> +			    u64 dsq_id, u64 enq_flags)
> +{
> +	struct scx_dispatch_q *dsq;
> +
> +	/* @p must match the task which is being enqueued */
> +	if (unlikely(p != ddsp_task)) {
> +		if (IS_ERR(ddsp_task))
> +			scx_ops_error("%s[%d] already direct-dispatched",
> +				      p->comm, p->pid);
> +		else
> +			scx_ops_error("enqueueing %s[%d] but trying to direct-dispatch %s[%d]",
> +				      ddsp_task->comm, ddsp_task->pid,
> +				      p->comm, p->pid);
> +		return;
> +	}
> +
> +	/*
> +	 * %SCX_DSQ_LOCAL_ON is not supported during direct dispatch because
> +	 * dispatching to the local DSQ of a different CPU requires unlocking
> +	 * the current rq which isn't allowed in the enqueue path. Use
> +	 * ops.select_cpu() to be on the target CPU and then %SCX_DSQ_LOCAL.
> +	 */
> +	if (unlikely((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON)) {
> +		scx_ops_error("SCX_DSQ_LOCAL_ON can't be used for direct-dispatch");
> +		return;
> +	}
> +
> +	dsq = find_dsq_for_dispatch(task_rq(p), dsq_id, p);
> +	dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> +
> +	/*
> +	 * Mark that dispatch already happened by spoiling direct_dispatch_task
> +	 * with a non-NULL value which can never match a valid task pointer.
> +	 */
> +	__this_cpu_write(direct_dispatch_task, ERR_PTR(-ESRCH));
> +}
> +
> +static bool test_rq_online(struct rq *rq)
> +{
> +#ifdef CONFIG_SMP
> +	return rq->online;
> +#else
> +	return true;
> +#endif
> +}
> +
> +static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
> +			    int sticky_cpu)
> +{
> +	struct task_struct **ddsp_taskp;
> +	unsigned long qseq;
> +
> +	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_QUEUED));
> +
> +	if (p->scx.flags & SCX_TASK_ENQ_LOCAL) {
> +		enq_flags |= SCX_ENQ_LOCAL;
> +		p->scx.flags &= ~SCX_TASK_ENQ_LOCAL;
> +	}
> +
> +	/* rq migration */
> +	if (sticky_cpu == cpu_of(rq))
> +		goto local_norefill;
> +
> +	/*
> +	 * If !rq->online, we already told the BPF scheduler that the CPU is
> +	 * offline. We're just trying to on/offline the CPU. Don't bother the
> +	 * BPF scheduler.
> +	 */
> +	if (unlikely(!test_rq_online(rq)))
> +		goto local;
> +
> +	/* see %SCX_OPS_ENQ_EXITING */
> +	if (!static_branch_unlikely(&scx_ops_enq_exiting) &&
> +	    unlikely(p->flags & PF_EXITING))
> +		goto local;
> +
> +	/* see %SCX_OPS_ENQ_LAST */
> +	if (!static_branch_unlikely(&scx_ops_enq_last) &&
> +	    (enq_flags & SCX_ENQ_LAST))
> +		goto local;
> +
> +	if (!SCX_HAS_OP(enqueue)) {
> +		if (enq_flags & SCX_ENQ_LOCAL)
> +			goto local;
> +		else
> +			goto global;
> +	}
> +
> +	/* DSQ bypass didn't trigger, enqueue on the BPF scheduler */
> +	qseq = rq->scx.ops_qseq++ << SCX_OPSS_QSEQ_SHIFT;
> +
> +	WARN_ON_ONCE(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> +	atomic_long_set(&p->scx.ops_state, SCX_OPSS_QUEUEING | qseq);
> +
> +	ddsp_taskp = this_cpu_ptr(&direct_dispatch_task);
> +	WARN_ON_ONCE(*ddsp_taskp);
> +	*ddsp_taskp = p;
> +
> +	SCX_CALL_OP(SCX_KF_ENQUEUE, enqueue, p, enq_flags);
> +
> +	/*
> +	 * If not directly dispatched, QUEUEING isn't clear yet and dispatch or
> +	 * dequeue may be waiting. The store_release matches their load_acquire.
> +	 */
> +	if (*ddsp_taskp == p)
> +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_QUEUED | qseq);
> +	*ddsp_taskp = NULL;
> +	return;
> +
> +local:
> +	p->scx.slice = SCX_SLICE_DFL;
> +local_norefill:
> +	dispatch_enqueue(&rq->scx.local_dsq, p, enq_flags);
> +	return;
> +
> +global:
> +	p->scx.slice = SCX_SLICE_DFL;
> +	dispatch_enqueue(&scx_dsq_global, p, enq_flags);
> +}
> +
> +static void enqueue_task_scx(struct rq *rq, struct task_struct *p, int enq_flags)
> +{
> +	int sticky_cpu = p->scx.sticky_cpu;
> +
> +	enq_flags |= rq->scx.extra_enq_flags;
> +
> +	if (sticky_cpu >= 0)
> +		p->scx.sticky_cpu = -1;
> +
> +	/*
> +	 * Restoring a running task will be immediately followed by
> +	 * set_next_task_scx() which expects the task to not be on the BPF
> +	 * scheduler as tasks can only start running through local DSQs. Force
> +	 * direct-dispatch into the local DSQ by setting the sticky_cpu.
> +	 */
> +	if (unlikely(enq_flags & ENQUEUE_RESTORE) && task_current(rq, p))
> +		sticky_cpu = cpu_of(rq);
> +
> +	if (p->scx.flags & SCX_TASK_QUEUED)
> +		return;
> +
> +	p->scx.flags |= SCX_TASK_QUEUED;
> +	rq->scx.nr_running++;
> +	add_nr_running(rq, 1);
> +
> +	do_enqueue_task(rq, p, enq_flags, sticky_cpu);
> +}
> +
> +static void ops_dequeue(struct task_struct *p, u64 deq_flags)
> +{
> +	unsigned long opss;
> +
> +	/* acquire ensures that we see the preceding updates on QUEUED */
> +	opss = atomic_long_read_acquire(&p->scx.ops_state);
> +
> +	switch (opss & SCX_OPSS_STATE_MASK) {
> +	case SCX_OPSS_NONE:
> +		break;
> +	case SCX_OPSS_QUEUEING:
> +		/*
> +		 * QUEUEING is started and finished while holding @p's rq lock.
> +		 * As we're holding the rq lock now, we shouldn't see QUEUEING.
> +		 */
> +		BUG();
> +	case SCX_OPSS_QUEUED:
> +		if (SCX_HAS_OP(dequeue))
> +			SCX_CALL_OP(SCX_KF_REST, dequeue, p, deq_flags);
> +
> +		if (atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> +					    SCX_OPSS_NONE))
> +			break;
> +		fallthrough;
> +	case SCX_OPSS_DISPATCHING:
> +		/*
> +		 * If @p is being dispatched from the BPF scheduler to a DSQ,
> +		 * wait for the transfer to complete so that @p doesn't get
> +		 * added to its DSQ after dequeueing is complete.
> +		 *
> +		 * As we're waiting on DISPATCHING with the rq locked, the
> +		 * dispatching side shouldn't try to lock the rq while
> +		 * DISPATCHING is set. See dispatch_to_local_dsq().
> +		 *
> +		 * DISPATCHING shouldn't have qseq set and control can reach
> +		 * here with NONE @opss from the above QUEUED case block.
> +		 * Explicitly wait on %SCX_OPSS_DISPATCHING instead of @opss.
> +		 */
> +		wait_ops_state(p, SCX_OPSS_DISPATCHING);
> +		BUG_ON(atomic_long_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> +		break;
> +	}
> +}
> +
> +static void dequeue_task_scx(struct rq *rq, struct task_struct *p, int deq_flags)
> +{
> +	struct scx_rq *scx_rq = &rq->scx;
> +
> +	if (!(p->scx.flags & SCX_TASK_QUEUED))
> +		return;
> +
> +	ops_dequeue(p, deq_flags);
> +
> +	if (deq_flags & SCX_DEQ_SLEEP)
> +		p->scx.flags |= SCX_TASK_DEQD_FOR_SLEEP;
> +	else
> +		p->scx.flags &= ~SCX_TASK_DEQD_FOR_SLEEP;
> +
> +	p->scx.flags &= ~SCX_TASK_QUEUED;
> +	scx_rq->nr_running--;
> +	sub_nr_running(rq, 1);
> +
> +	dispatch_dequeue(scx_rq, p);
> +}
> +
> +static void yield_task_scx(struct rq *rq)
> +{
> +	struct task_struct *p = rq->curr;
> +
> +	if (SCX_HAS_OP(yield))
> +		SCX_CALL_OP_RET(SCX_KF_REST, yield, p, NULL);
> +	else
> +		p->scx.slice = 0;
> +}
> +
> +static bool yield_to_task_scx(struct rq *rq, struct task_struct *to)
> +{
> +	struct task_struct *from = rq->curr;
> +
> +	if (SCX_HAS_OP(yield))
> +		return SCX_CALL_OP_RET(SCX_KF_REST, yield, from, to);
> +	else
> +		return false;
> +}
> +
> +#ifdef CONFIG_SMP
> +/**
> + * move_task_to_local_dsq - Move a task from a different rq to a local DSQ
> + * @rq: rq to move the task into, currently locked
> + * @p: task to move
> + * @enq_flags: %SCX_ENQ_*
> + *
> + * Move @p which is currently on a different rq to @rq's local DSQ. The caller
> + * must:
> + *
> + * 1. Start with exclusive access to @p either through its DSQ lock or
> + *    %SCX_OPSS_DISPATCHING flag.
> + *
> + * 2. Set @p->scx.holding_cpu to raw_smp_processor_id().
> + *
> + * 3. Remember task_rq(@p). Release the exclusive access so that we don't
> + *    deadlock with dequeue.
> + *
> + * 4. Lock @rq and the task_rq from #3.
> + *
> + * 5. Call this function.
> + *
> + * Returns %true if @p was successfully moved. %false after racing dequeue and
> + * losing.
> + */
> +static bool move_task_to_local_dsq(struct rq *rq, struct task_struct *p,
> +				   u64 enq_flags)
> +{
> +	struct rq *task_rq;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	/*
> +	 * If dequeue got to @p while we were trying to lock both rq's, it'd
> +	 * have cleared @p->scx.holding_cpu to -1. While other cpus may have
> +	 * updated it to different values afterwards, as this operation can't be
> +	 * preempted or recurse, @p->scx.holding_cpu can never become
> +	 * raw_smp_processor_id() again before we're done. Thus, we can tell
> +	 * whether we lost to dequeue by testing whether @p->scx.holding_cpu is
> +	 * still raw_smp_processor_id().
> +	 *
> +	 * See dispatch_dequeue() for the counterpart.
> +	 */
> +	if (unlikely(p->scx.holding_cpu != raw_smp_processor_id()))
> +		return false;
> +
> +	/* @p->rq couldn't have changed if we're still the holding cpu */
> +	task_rq = task_rq(p);
> +	lockdep_assert_rq_held(task_rq);
> +
> +	WARN_ON_ONCE(!cpumask_test_cpu(cpu_of(rq), p->cpus_ptr));
> +	deactivate_task(task_rq, p, 0);
> +	set_task_cpu(p, cpu_of(rq));
> +	p->scx.sticky_cpu = cpu_of(rq);
> +
> +	/*
> +	 * We want to pass scx-specific enq_flags but activate_task() will
> +	 * truncate the upper 32 bit. As we own @rq, we can pass them through
> +	 * @rq->scx.extra_enq_flags instead.
> +	 */
> +	WARN_ON_ONCE(rq->scx.extra_enq_flags);
> +	rq->scx.extra_enq_flags = enq_flags;
> +	activate_task(rq, p, 0);
> +	rq->scx.extra_enq_flags = 0;
> +
> +	return true;
> +}
> +
> +/**
> + * dispatch_to_local_dsq_lock - Ensure source and desitnation rq's are locked
> + * @rq: current rq which is locked
> + * @rf: rq_flags to use when unlocking @rq
> + * @src_rq: rq to move task from
> + * @dst_rq: rq to move task to
> + *
> + * We're holding @rq lock and trying to dispatch a task from @src_rq to
> + * @dst_rq's local DSQ and thus need to lock both @src_rq and @dst_rq. Whether
> + * @rq stays locked isn't important as long as the state is restored after
> + * dispatch_to_local_dsq_unlock().
> + */
> +static void dispatch_to_local_dsq_lock(struct rq *rq, struct rq_flags *rf,
> +				       struct rq *src_rq, struct rq *dst_rq)
> +{
> +	rq_unpin_lock(rq, rf);
> +
> +	if (src_rq == dst_rq) {
> +		raw_spin_rq_unlock(rq);
> +		raw_spin_rq_lock(dst_rq);
> +	} else if (rq == src_rq) {
> +		double_lock_balance(rq, dst_rq);
> +		rq_repin_lock(rq, rf);
> +	} else if (rq == dst_rq) {
> +		double_lock_balance(rq, src_rq);
> +		rq_repin_lock(rq, rf);
> +	} else {
> +		raw_spin_rq_unlock(rq);
> +		double_rq_lock(src_rq, dst_rq);
> +	}
> +}
> +
> +/**
> + * dispatch_to_local_dsq_unlock - Undo dispatch_to_local_dsq_lock()
> + * @rq: current rq which is locked
> + * @rf: rq_flags to use when unlocking @rq
> + * @src_rq: rq to move task from
> + * @dst_rq: rq to move task to
> + *
> + * Unlock @src_rq and @dst_rq and ensure that @rq is locked on return.
> + */
> +static void dispatch_to_local_dsq_unlock(struct rq *rq, struct rq_flags *rf,
> +					 struct rq *src_rq, struct rq *dst_rq)
> +{
> +	if (src_rq == dst_rq) {
> +		raw_spin_rq_unlock(dst_rq);
> +		raw_spin_rq_lock(rq);
> +		rq_repin_lock(rq, rf);
> +	} else if (rq == src_rq) {
> +		double_unlock_balance(rq, dst_rq);
> +	} else if (rq == dst_rq) {
> +		double_unlock_balance(rq, src_rq);
> +	} else {
> +		double_rq_unlock(src_rq, dst_rq);
> +		raw_spin_rq_lock(rq);
> +		rq_repin_lock(rq, rf);
> +	}
> +}
> +#endif	/* CONFIG_SMP */
> +
> +
> +static bool consume_dispatch_q(struct rq *rq, struct rq_flags *rf,
> +			       struct scx_dispatch_q *dsq)
> +{
> +	struct scx_rq *scx_rq = &rq->scx;
> +	struct task_struct *p;
> +	struct rq *task_rq;
> +	bool moved = false;
> +retry:
> +	if (list_empty(&dsq->fifo))
> +		return false;
> +
> +	raw_spin_lock(&dsq->lock);
> +	list_for_each_entry(p, &dsq->fifo, scx.dsq_node) {
> +		task_rq = task_rq(p);
> +		if (rq == task_rq)
> +			goto this_rq;
> +		if (likely(test_rq_online(rq)) && !is_migration_disabled(p) &&
> +		    cpumask_test_cpu(cpu_of(rq), p->cpus_ptr))
> +			goto remote_rq;
> +	}
> +	raw_spin_unlock(&dsq->lock);
> +	return false;
> +
> +this_rq:
> +	/* @dsq is locked and @p is on this rq */
> +	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> +	list_move_tail(&p->scx.dsq_node, &scx_rq->local_dsq.fifo);
> +	dsq->nr--;
> +	scx_rq->local_dsq.nr++;
> +	p->scx.dsq = &scx_rq->local_dsq;
> +	raw_spin_unlock(&dsq->lock);
> +	return true;
> +
> +remote_rq:
> +#ifdef CONFIG_SMP
> +	/*
> +	 * @dsq is locked and @p is on a remote rq. @p is currently protected by
> +	 * @dsq->lock. We want to pull @p to @rq but may deadlock if we grab
> +	 * @task_rq while holding @dsq and @rq locks. As dequeue can't drop the
> +	 * rq lock or fail, do a little dancing from our side. See
> +	 * move_task_to_local_dsq().
> +	 */
> +	WARN_ON_ONCE(p->scx.holding_cpu >= 0);
> +	list_del_init(&p->scx.dsq_node);
> +	dsq->nr--;
> +	p->scx.holding_cpu = raw_smp_processor_id();
> +	raw_spin_unlock(&dsq->lock);
> +
> +	rq_unpin_lock(rq, rf);
> +	double_lock_balance(rq, task_rq);
> +	rq_repin_lock(rq, rf);
> +
> +	moved = move_task_to_local_dsq(rq, p, 0);
> +
> +	double_unlock_balance(rq, task_rq);
> +#endif /* CONFIG_SMP */
> +	if (likely(moved))
> +		return true;
> +	goto retry;
> +}
> +
> +enum dispatch_to_local_dsq_ret {
> +	DTL_DISPATCHED,		/* successfully dispatched */
> +	DTL_LOST,		/* lost race to dequeue */
> +	DTL_NOT_LOCAL,		/* destination is not a local DSQ */
> +	DTL_INVALID,		/* invalid local dsq_id */
> +};
> +
> +/**
> + * dispatch_to_local_dsq - Dispatch a task to a local dsq
> + * @rq: current rq which is locked
> + * @rf: rq_flags to use when unlocking @rq
> + * @dsq_id: destination dsq ID
> + * @p: task to dispatch
> + * @enq_flags: %SCX_ENQ_*
> + *
> + * We're holding @rq lock and want to dispatch @p to the local DSQ identified by
> + * @dsq_id. This function performs all the synchronization dancing needed
> + * because local DSQs are protected with rq locks.
> + *
> + * The caller must have exclusive ownership of @p (e.g. through
> + * %SCX_OPSS_DISPATCHING).
> + */
> +static enum dispatch_to_local_dsq_ret
> +dispatch_to_local_dsq(struct rq *rq, struct rq_flags *rf, u64 dsq_id,
> +		      struct task_struct *p, u64 enq_flags)
> +{
> +	struct rq *src_rq = task_rq(p);
> +	struct rq *dst_rq;
> +
> +	/*
> +	 * We're synchronized against dequeue through DISPATCHING. As @p can't
> +	 * be dequeued, its task_rq and cpus_allowed are stable too.
> +	 */
> +	if (dsq_id == SCX_DSQ_LOCAL) {
> +		dst_rq = rq;
> +	} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
> +		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
> +
> +		if (!ops_cpu_valid(cpu)) {
> +			scx_ops_error("invalid cpu %d in SCX_DSQ_LOCAL_ON verdict for %s[%d]",
> +				      cpu, p->comm, p->pid);
> +			return DTL_INVALID;
> +		}
> +		dst_rq = cpu_rq(cpu);
> +	} else {
> +		return DTL_NOT_LOCAL;
> +	}
> +
> +	/* if dispatching to @rq that @p is already on, no lock dancing needed */
> +	if (rq == src_rq && rq == dst_rq) {
> +		dispatch_enqueue(&dst_rq->scx.local_dsq, p,
> +				 enq_flags | SCX_ENQ_CLEAR_OPSS);
> +		return DTL_DISPATCHED;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	if (cpumask_test_cpu(cpu_of(dst_rq), p->cpus_ptr)) {
> +		struct rq *locked_dst_rq = dst_rq;
> +		bool dsp;
> +
> +		/*
> +		 * @p is on a possibly remote @src_rq which we need to lock to
> +		 * move the task. If dequeue is in progress, it'd be locking
> +		 * @src_rq and waiting on DISPATCHING, so we can't grab @src_rq
> +		 * lock while holding DISPATCHING.
> +		 *
> +		 * As DISPATCHING guarantees that @p is wholly ours, we can
> +		 * pretend that we're moving from a DSQ and use the same
> +		 * mechanism - mark the task under transfer with holding_cpu,
> +		 * release DISPATCHING and then follow the same protocol.
> +		 */
> +		p->scx.holding_cpu = raw_smp_processor_id();
> +
> +		/* store_release ensures that dequeue sees the above */
> +		atomic_long_set_release(&p->scx.ops_state, SCX_OPSS_NONE);
> +
> +		dispatch_to_local_dsq_lock(rq, rf, src_rq, locked_dst_rq);
> +
> +		/*
> +		 * We don't require the BPF scheduler to avoid dispatching to
> +		 * offline CPUs mostly for convenience but also because CPUs can
> +		 * go offline between scx_bpf_dispatch() calls and here. If @p
> +		 * is destined to an offline CPU, queue it on its current CPU
> +		 * instead, which should always be safe. As this is an allowed
> +		 * behavior, don't trigger an ops error.
> +		 */
> +		if (unlikely(!test_rq_online(dst_rq)))
> +			dst_rq = src_rq;
> +
> +		if (src_rq == dst_rq) {
> +			/*
> +			 * As @p is staying on the same rq, there's no need to
> +			 * go through the full deactivate/activate cycle.
> +			 * Optimize by abbreviating the operations in
> +			 * move_task_to_local_dsq().
> +			 */
> +			dsp = p->scx.holding_cpu == raw_smp_processor_id();
> +			if (likely(dsp)) {
> +				p->scx.holding_cpu = -1;
> +				dispatch_enqueue(&dst_rq->scx.local_dsq, p,
> +						 enq_flags);
> +			}
> +		} else {
> +			dsp = move_task_to_local_dsq(dst_rq, p, enq_flags);
> +		}
> +
> +		/* if the destination CPU is idle, wake it up */
> +		if (dsp && p->sched_class > dst_rq->curr->sched_class)
> +			resched_curr(dst_rq);
> +
> +		dispatch_to_local_dsq_unlock(rq, rf, src_rq, locked_dst_rq);
> +
> +		return dsp ? DTL_DISPATCHED : DTL_LOST;
> +	}
> +#endif /* CONFIG_SMP */
> +
> +	scx_ops_error("SCX_DSQ_LOCAL[_ON] verdict target cpu %d not allowed for %s[%d]",
> +		      cpu_of(dst_rq), p->comm, p->pid);
> +	return DTL_INVALID;
> +}
> +
> +/**
> + * finish_dispatch - Asynchronously finish dispatching a task
> + * @rq: current rq which is locked
> + * @rf: rq_flags to use when unlocking @rq
> + * @p: task to finish dispatching
> + * @qseq_at_dispatch: qseq when @p started getting dispatched
> + * @dsq_id: destination DSQ ID
> + * @enq_flags: %SCX_ENQ_*
> + *
> + * Dispatching to local DSQs may need to wait for queueing to complete or
> + * require rq lock dancing. As we don't wanna do either while inside
> + * ops.dispatch() to avoid locking order inversion, we split dispatching into
> + * two parts. scx_bpf_dispatch() which is called by ops.dispatch() records the
> + * task and its qseq. Once ops.dispatch() returns, this function is called to
> + * finish up.
> + *
> + * There is no guarantee that @p is still valid for dispatching or even that it
> + * was valid in the first place. Make sure that the task is still owned by the
> + * BPF scheduler and claim the ownership before dispatching.
> + */
> +static void finish_dispatch(struct rq *rq, struct rq_flags *rf,
> +			    struct task_struct *p,
> +			    unsigned long qseq_at_dispatch,
> +			    u64 dsq_id, u64 enq_flags)
> +{
> +	struct scx_dispatch_q *dsq;
> +	unsigned long opss;
> +
> +retry:
> +	/*
> +	 * No need for _acquire here. @p is accessed only after a successful
> +	 * try_cmpxchg to DISPATCHING.
> +	 */
> +	opss = atomic_long_read(&p->scx.ops_state);
> +
> +	switch (opss & SCX_OPSS_STATE_MASK) {
> +	case SCX_OPSS_DISPATCHING:
> +	case SCX_OPSS_NONE:
> +		/* someone else already got to it */
> +		return;
> +	case SCX_OPSS_QUEUED:
> +		/*
> +		 * If qseq doesn't match, @p has gone through at least one
> +		 * dispatch/dequeue and re-enqueue cycle between
> +		 * scx_bpf_dispatch() and here and we have no claim on it.
> +		 */
> +		if ((opss & SCX_OPSS_QSEQ_MASK) != qseq_at_dispatch)
> +			return;
> +
> +		/*
> +		 * While we know @p is accessible, we don't yet have a claim on
> +		 * it - the BPF scheduler is allowed to dispatch tasks
> +		 * spuriously and there can be a racing dequeue attempt. Let's
> +		 * claim @p by atomically transitioning it from QUEUED to
> +		 * DISPATCHING.
> +		 */
> +		if (likely(atomic_long_try_cmpxchg(&p->scx.ops_state, &opss,
> +						   SCX_OPSS_DISPATCHING)))
> +			break;
> +		goto retry;
> +	case SCX_OPSS_QUEUEING:
> +		/*
> +		 * do_enqueue_task() is in the process of transferring the task
> +		 * to the BPF scheduler while holding @p's rq lock. As we aren't
> +		 * holding any kernel or BPF resource that the enqueue path may
> +		 * depend upon, it's safe to wait.
> +		 */
> +		wait_ops_state(p, opss);
> +		goto retry;
> +	}
> +
> +	BUG_ON(!(p->scx.flags & SCX_TASK_QUEUED));
> +
> +	switch (dispatch_to_local_dsq(rq, rf, dsq_id, p, enq_flags)) {
> +	case DTL_DISPATCHED:
> +		break;
> +	case DTL_LOST:
> +		break;
> +	case DTL_INVALID:
> +		dsq_id = SCX_DSQ_GLOBAL;
> +		fallthrough;
> +	case DTL_NOT_LOCAL:
> +		dsq = find_dsq_for_dispatch(cpu_rq(raw_smp_processor_id()),
> +					    dsq_id, p);
> +		dispatch_enqueue(dsq, p, enq_flags | SCX_ENQ_CLEAR_OPSS);
> +		break;
> +	}
> +}
> +
> +static void flush_dispatch_buf(struct rq *rq, struct rq_flags *rf)
> +{
> +	struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
> +	u32 u;
> +
> +	for (u = 0; u < dspc->buf_cursor; u++) {
> +		struct scx_dsp_buf_ent *ent = &this_cpu_ptr(scx_dsp_buf)[u];
> +
> +		finish_dispatch(rq, rf, ent->task, ent->qseq, ent->dsq_id,
> +				ent->enq_flags);
> +	}
> +
> +	dspc->nr_tasks += dspc->buf_cursor;
> +	dspc->buf_cursor = 0;
> +}
> +
> +static int balance_scx(struct rq *rq, struct task_struct *prev,
> +		       struct rq_flags *rf)
> +{
> +	struct scx_rq *scx_rq = &rq->scx;
> +	struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
> +	bool prev_on_scx = prev->sched_class == &ext_sched_class;
> +
> +	lockdep_assert_rq_held(rq);
> +
> +	if (prev_on_scx) {
> +		WARN_ON_ONCE(prev->scx.flags & SCX_TASK_BAL_KEEP);
> +		update_curr_scx(rq);
> +
> +		/*
> +		 * If @prev is runnable & has slice left, it has priority and
> +		 * fetching more just increases latency for the fetched tasks.
> +		 * Tell put_prev_task_scx() to put @prev on local_dsq.
> +		 *
> +		 * See scx_ops_disable_workfn() for the explanation on the
> +		 * disabling() test.
> +		 */
> +		if ((prev->scx.flags & SCX_TASK_QUEUED) &&
> +		    prev->scx.slice && !scx_ops_disabling()) {
> +			prev->scx.flags |= SCX_TASK_BAL_KEEP;
> +			return 1;
> +		}
> +	}
> +
> +	/* if there already are tasks to run, nothing to do */
> +	if (scx_rq->local_dsq.nr)
> +		return 1;
> +
> +	if (consume_dispatch_q(rq, rf, &scx_dsq_global))
> +		return 1;
> +
> +	if (!SCX_HAS_OP(dispatch))
> +		return 0;
> +
> +	dspc->rq = rq;
> +	dspc->rf = rf;
> +
> +	/*
> +	 * The dispatch loop. Because flush_dispatch_buf() may drop the rq lock,
> +	 * the local DSQ might still end up empty after a successful
> +	 * ops.dispatch(). If the local DSQ is empty even after ops.dispatch()
> +	 * produced some tasks, retry. The BPF scheduler may depend on this
> +	 * looping behavior to simplify its implementation.
> +	 */
> +	do {
> +		dspc->nr_tasks = 0;
> +
> +		SCX_CALL_OP(SCX_KF_DISPATCH, dispatch, cpu_of(rq),
> +			    prev_on_scx ? prev : NULL);
> +
> +		flush_dispatch_buf(rq, rf);
> +
> +		if (scx_rq->local_dsq.nr)
> +			return 1;
> +		if (consume_dispatch_q(rq, rf, &scx_dsq_global))
> +			return 1;
> +	} while (dspc->nr_tasks);
> +
> +	return 0;
> +}
> +
> +static void set_next_task_scx(struct rq *rq, struct task_struct *p, bool first)
> +{
> +	if (p->scx.flags & SCX_TASK_QUEUED) {
> +		WARN_ON_ONCE(atomic64_read(&p->scx.ops_state) != SCX_OPSS_NONE);
> +		dispatch_dequeue(&rq->scx, p);
> +	}
> +
> +	p->se.exec_start = rq_clock_task(rq);
> +}
> +
> +static void put_prev_task_scx(struct rq *rq, struct task_struct *p)
> +{
> +#ifndef CONFIG_SMP
> +	/*
> +	 * UP workaround.
> +	 *
> +	 * Because SCX may transfer tasks across CPUs during dispatch, dispatch
> +	 * is performed from its balance operation which isn't called in UP.
> +	 * Let's work around by calling it from the operations which come right
> +	 * after.
> +	 *
> +	 * 1. If the prev task is on SCX, pick_next_task() calls
> +	 *    .put_prev_task() right after. As .put_prev_task() is also called
> +	 *    from other places, we need to distinguish the calls which can be
> +	 *    done by looking at the previous task's state - if still queued or
> +	 *    dequeued with %SCX_DEQ_SLEEP, the caller must be pick_next_task().
> +	 *    This case is handled here.
> +	 *
> +	 * 2. If the prev task is not on SCX, the first following call into SCX
> +	 *    will be .pick_next_task(), which is covered by calling
> +	 *    balance_scx() from pick_next_task_scx().
> +	 *
> +	 * Note that we can't merge the first case into the second as
> +	 * balance_scx() must be called before the previous SCX task goes
> +	 * through put_prev_task_scx().
> +	 *
> +	 * As UP doesn't transfer tasks around, balance_scx() doesn't need @rf.
> +	 * Pass in %NULL.
> +	 */
> +	if (p->scx.flags & (SCX_TASK_QUEUED | SCX_TASK_DEQD_FOR_SLEEP))
> +		balance_scx(rq, p, NULL);
> +#endif
> +
> +	update_curr_scx(rq);
> +
> +	/*
> +	 * If we're being called from put_prev_task_balance(), balance_scx() may
> +	 * have decided that @p should keep running.
> +	 */
> +	if (p->scx.flags & SCX_TASK_BAL_KEEP) {
> +		p->scx.flags &= ~SCX_TASK_BAL_KEEP;
> +		dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
> +		return;
> +	}
> +
> +	if (p->scx.flags & SCX_TASK_QUEUED) {
> +		/*
> +		 * If @p has slice left and balance_scx() didn't tag it for
> +		 * keeping, @p is getting preempted by a higher priority
> +		 * scheduler class. Leave it at the head of the local DSQ.
> +		 */
> +		if (p->scx.slice && !scx_ops_disabling()) {
> +			dispatch_enqueue(&rq->scx.local_dsq, p, SCX_ENQ_HEAD);
> +			return;
> +		}
> +
> +		/*
> +		 * If we're in the pick_next_task path, balance_scx() should
> +		 * have already populated the local DSQ if there are any other
> +		 * available tasks. If empty, tell ops.enqueue() that @p is the
> +		 * only one available for this cpu. ops.enqueue() should put it
> +		 * on the local DSQ so that the subsequent pick_next_task_scx()
> +		 * can find the task unless it wants to trigger a separate
> +		 * follow-up scheduling event.
> +		 */
> +		if (list_empty(&rq->scx.local_dsq.fifo))
> +			do_enqueue_task(rq, p, SCX_ENQ_LAST | SCX_ENQ_LOCAL, -1);
> +		else
> +			do_enqueue_task(rq, p, 0, -1);
> +	}
> +}
> +
> +static struct task_struct *first_local_task(struct rq *rq)
> +{
> +	return list_first_entry_or_null(&rq->scx.local_dsq.fifo,
> +					struct task_struct, scx.dsq_node);
> +}
> +
> +static struct task_struct *pick_next_task_scx(struct rq *rq)
> +{
> +	struct task_struct *p;
> +
> +#ifndef CONFIG_SMP
> +	/* UP workaround - see the comment at the head of put_prev_task_scx() */
> +	if (unlikely(rq->curr->sched_class != &ext_sched_class))
> +		balance_scx(rq, rq->curr, NULL);
> +#endif
> +
> +	p = first_local_task(rq);
> +	if (!p)
> +		return NULL;
> +
> +	if (unlikely(!p->scx.slice)) {
> +		if (!scx_ops_disabling() && !scx_warned_zero_slice) {
> +			printk_deferred(KERN_WARNING "sched_ext: %s[%d] has zero slice in pick_next_task_scx()\n",
> +					p->comm, p->pid);
> +			scx_warned_zero_slice = true;
> +		}
> +		p->scx.slice = SCX_SLICE_DFL;
> +	}
> +
> +	set_next_task_scx(rq, p, true);
> +
> +	return p;
> +}
> +
> +#ifdef CONFIG_SMP
> +
> +static bool test_and_clear_cpu_idle(int cpu)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +	/*
> +	 * SMT mask should be cleared whether we can claim @cpu or not. The SMT
> +	 * cluster is not wholly idle either way. This also prevents
> +	 * scx_pick_idle_cpu() from getting caught in an infinite loop.
> +	 */
> +	if (sched_smt_active()) {
> +		const struct cpumask *smt = cpu_smt_mask(cpu);
> +
> +		/*
> +		 * If offline, @cpu is not its own sibling and
> +		 * scx_pick_idle_cpu() can get caught in an infinite loop as
> +		 * @cpu is never cleared from idle_masks.smt. Ensure that @cpu
> +		 * is eventually cleared.
> +		 */
> +		if (cpumask_intersects(smt, idle_masks.smt))
> +			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +		else if (cpumask_test_cpu(cpu, idle_masks.smt))
> +			__cpumask_clear_cpu(cpu, idle_masks.smt);
> +	}
> +#endif
> +	return cpumask_test_and_clear_cpu(cpu, idle_masks.cpu);
> +}
> +
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +{
> +	int cpu;
> +
> +retry:
> +	if (sched_smt_active()) {
> +		cpu = cpumask_any_and_distribute(idle_masks.smt, cpus_allowed);
> +		if (cpu < nr_cpu_ids)
> +			goto found;
> +
> +		if (flags & SCX_PICK_IDLE_CORE)
> +			return -EBUSY;
> +	}
> +
> +	cpu = cpumask_any_and_distribute(idle_masks.cpu, cpus_allowed);
> +	if (cpu >= nr_cpu_ids)
> +		return -EBUSY;
> +
> +found:
> +	if (test_and_clear_cpu_idle(cpu))
> +		return cpu;
> +	else
> +		goto retry;
> +}
> +
> +static s32 scx_select_cpu_dfl(struct task_struct *p, s32 prev_cpu, u64 wake_flags)
> +{
> +	s32 cpu;
> +
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return prev_cpu;
> +	}
> +
> +	/*
> +	 * If WAKE_SYNC and the machine isn't fully saturated, wake up @p to the
> +	 * local DSQ of the waker.
> +	 */
> +	if ((wake_flags & SCX_WAKE_SYNC) && p->nr_cpus_allowed > 1 &&
> +	    !cpumask_empty(idle_masks.cpu) && !(current->flags & PF_EXITING)) {
> +		cpu = smp_processor_id();
> +		if (cpumask_test_cpu(cpu, p->cpus_ptr)) {
> +			p->scx.flags |= SCX_TASK_ENQ_LOCAL;
> +			return cpu;
> +		}
> +	}
> +
> +	if (p->nr_cpus_allowed == 1)
> +		return prev_cpu;
> +
> +	/*
> +	 * If CPU has SMT, any wholly idle CPU is likely a better pick than
> +	 * partially idle @prev_cpu.
> +	 */
> +	if (sched_smt_active()) {
> +		if (cpumask_test_cpu(prev_cpu, idle_masks.smt) &&
> +		    test_and_clear_cpu_idle(prev_cpu)) {
> +			p->scx.flags |= SCX_TASK_ENQ_LOCAL;
> +			return prev_cpu;
> +		}
> +
> +		cpu = scx_pick_idle_cpu(p->cpus_ptr, SCX_PICK_IDLE_CORE);
> +		if (cpu >= 0) {
> +			p->scx.flags |= SCX_TASK_ENQ_LOCAL;
> +			return cpu;
> +		}
> +	}
> +
> +	if (test_and_clear_cpu_idle(prev_cpu)) {
> +		p->scx.flags |= SCX_TASK_ENQ_LOCAL;
> +		return prev_cpu;
> +	}
> +
> +	cpu = scx_pick_idle_cpu(p->cpus_ptr, 0);
> +	if (cpu >= 0) {
> +		p->scx.flags |= SCX_TASK_ENQ_LOCAL;
> +		return cpu;
> +	}
> +
> +	return prev_cpu;
> +}
> +
> +static int select_task_rq_scx(struct task_struct *p, int prev_cpu, int wake_flags)
> +{
> +	if (SCX_HAS_OP(select_cpu)) {
> +		s32 cpu;
> +
> +		cpu = SCX_CALL_OP_RET(SCX_KF_REST, select_cpu, p, prev_cpu,
> +				      wake_flags);
> +		if (ops_cpu_valid(cpu)) {
> +			return cpu;
> +		} else {
> +			scx_ops_error("select_cpu returned invalid cpu %d", cpu);
> +			return prev_cpu;
> +		}
> +	} else {
> +		return scx_select_cpu_dfl(p, prev_cpu, wake_flags);
> +	}
> +}
> +
> +static void set_cpus_allowed_scx(struct task_struct *p,
> +				 struct affinity_context *ac)
> +{
> +	set_cpus_allowed_common(p, ac);
> +
> +	/*
> +	 * The effective cpumask is stored in @p->cpus_ptr which may temporarily
> +	 * differ from the configured one in @p->cpus_mask. Always tell the bpf
> +	 * scheduler the effective one.
> +	 *
> +	 * Fine-grained memory write control is enforced by BPF making the const
> +	 * designation pointless. Cast it away when calling the operation.
> +	 */
> +	if (SCX_HAS_OP(set_cpumask))
> +		SCX_CALL_OP(SCX_KF_REST, set_cpumask, p,
> +			    (struct cpumask *)p->cpus_ptr);
> +}
> +
> +static void reset_idle_masks(void)
> +{
> +	/* consider all cpus idle, should converge to the actual state quickly */
> +	cpumask_setall(idle_masks.cpu);
> +	cpumask_setall(idle_masks.smt);
> +}
> +
> +void __scx_update_idle(struct rq *rq, bool idle)
> +{
> +	int cpu = cpu_of(rq);
> +
> +	if (SCX_HAS_OP(update_idle)) {
> +		SCX_CALL_OP(SCX_KF_REST, update_idle, cpu_of(rq), idle);
> +		if (!static_branch_unlikely(&scx_builtin_idle_enabled))
> +			return;
> +	}
> +
> +	if (idle)
> +		cpumask_set_cpu(cpu, idle_masks.cpu);
> +	else
> +		cpumask_clear_cpu(cpu, idle_masks.cpu);
> +
> +#ifdef CONFIG_SCHED_SMT
> +	if (sched_smt_active()) {
> +		const struct cpumask *smt = cpu_smt_mask(cpu);
> +
> +		if (idle) {
> +			/*
> +			 * idle_masks.smt handling is racy but that's fine as
> +			 * it's only for optimization and self-correcting.
> +			 */
> +			for_each_cpu(cpu, smt) {
> +				if (!cpumask_test_cpu(cpu, idle_masks.cpu))
> +					return;
> +			}
> +			cpumask_or(idle_masks.smt, idle_masks.smt, smt);
> +		} else {
> +			cpumask_andnot(idle_masks.smt, idle_masks.smt, smt);
> +		}
> +	}
> +#endif
> +}
> +
> +#else /* !CONFIG_SMP */
> +
> +static bool test_and_clear_cpu_idle(int cpu) { return false; }
> +static s32 scx_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags) { return -EBUSY; }
> +static void reset_idle_masks(void) {}
> +
> +#endif /* CONFIG_SMP */
> +
> +static void task_tick_scx(struct rq *rq, struct task_struct *curr, int queued)
> +{
> +	update_curr_scx(rq);
> +
> +	/*
> +	 * While disabling, always resched as we can't trust the slice
> +	 * management.
> +	 */
> +	if (scx_ops_disabling())
> +		curr->scx.slice = 0;
> +
> +	if (!curr->scx.slice)
> +		resched_curr(rq);
> +}
> +
> +static int scx_ops_prepare_task(struct task_struct *p, struct task_group *tg)
> +{
> +	int ret;
> +
> +	WARN_ON_ONCE(p->scx.flags & SCX_TASK_OPS_PREPPED);
> +
> +	if (SCX_HAS_OP(prep_enable)) {
> +		struct scx_enable_args args = { };
> +
> +		ret = SCX_CALL_OP_RET(SCX_KF_SLEEPABLE, prep_enable, p, &args);
> +		if (unlikely(ret)) {
> +			ret = ops_sanitize_err("prep_enable", ret);
> +			return ret;
> +		}
> +	}
> +
> +	p->scx.flags |= SCX_TASK_OPS_PREPPED;
> +	return 0;
> +}
> +
> +static void scx_ops_enable_task(struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(task_rq(p));
> +	WARN_ON_ONCE(!(p->scx.flags & SCX_TASK_OPS_PREPPED));
> +
> +	if (SCX_HAS_OP(enable)) {
> +		struct scx_enable_args args = { };
> +		SCX_CALL_OP(SCX_KF_REST, enable, p, &args);
> +	}
> +	p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
> +	p->scx.flags |= SCX_TASK_OPS_ENABLED;
> +}
> +
> +static void scx_ops_disable_task(struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(task_rq(p));
> +
> +	if (p->scx.flags & SCX_TASK_OPS_PREPPED) {
> +		if (SCX_HAS_OP(cancel_enable)) {
> +			struct scx_enable_args args = { };
> +			SCX_CALL_OP(SCX_KF_REST, cancel_enable, p, &args);
> +		}
> +		p->scx.flags &= ~SCX_TASK_OPS_PREPPED;
> +	} else if (p->scx.flags & SCX_TASK_OPS_ENABLED) {
> +		if (SCX_HAS_OP(disable))
> +			SCX_CALL_OP(SCX_KF_REST, disable, p);
> +		p->scx.flags &= ~SCX_TASK_OPS_ENABLED;
> +	}
> +}
> +
> +static void set_task_scx_weight(struct task_struct *p)
> +{
> +	u32 weight = sched_prio_to_weight[p->static_prio - MAX_RT_PRIO];
> +
> +	p->scx.weight = sched_weight_to_cgroup(weight);
> +}
> +
> +/**
> + * refresh_scx_weight - Refresh a task's ext weight
> + * @p: task to refresh ext weight for
> + *
> + * @p->scx.weight carries the task's static priority in cgroup weight scale to
> + * enable easy access from the BPF scheduler. To keep it synchronized with the
> + * current task priority, this function should be called when a new task is
> + * created, priority is changed for a task on sched_ext, and a task is switched
> + * to sched_ext from other classes.
> + */
> +static void refresh_scx_weight(struct task_struct *p)
> +{
> +	lockdep_assert_rq_held(task_rq(p));
> +	set_task_scx_weight(p);
> +	if (SCX_HAS_OP(set_weight))
> +		SCX_CALL_OP(SCX_KF_REST, set_weight, p, p->scx.weight);
> +}
> +
> +void scx_pre_fork(struct task_struct *p)
> +{
> +	/*
> +	 * BPF scheduler enable/disable paths want to be able to iterate and
> +	 * update all tasks which can become complex when racing forks. As
> +	 * enable/disable are very cold paths, let's use a percpu_rwsem to
> +	 * exclude forks.
> +	 */
> +	percpu_down_read(&scx_fork_rwsem);
> +}
> +
> +int scx_fork(struct task_struct *p)
> +{
> +	percpu_rwsem_assert_held(&scx_fork_rwsem);
> +
> +	if (scx_enabled())
> +		return scx_ops_prepare_task(p, task_group(p));
> +	else
> +		return 0;
> +}
> +
> +void scx_post_fork(struct task_struct *p)
> +{
> +	if (scx_enabled()) {
> +		struct rq_flags rf;
> +		struct rq *rq;
> +
> +		rq = task_rq_lock(p, &rf);
> +		/*
> +		 * Set the weight manually before calling ops.enable() so that
> +		 * the scheduler doesn't see a stale value if they inspect the
> +		 * task struct. We'll invoke ops.set_weight() afterwards, as it
> +		 * would be odd to receive a callback on the task before we
> +		 * tell the scheduler that it's been fully enabled.
> +		 */
> +		set_task_scx_weight(p);
> +		scx_ops_enable_task(p);
> +		refresh_scx_weight(p);
> +		task_rq_unlock(rq, p, &rf);
> +	}
> +
> +	spin_lock_irq(&scx_tasks_lock);
> +	list_add_tail(&p->scx.tasks_node, &scx_tasks);
> +	spin_unlock_irq(&scx_tasks_lock);
> +
> +	percpu_up_read(&scx_fork_rwsem);
> +}
> +
> +void scx_cancel_fork(struct task_struct *p)
> +{
> +	if (scx_enabled())
> +		scx_ops_disable_task(p);
> +	percpu_up_read(&scx_fork_rwsem);
> +}
> +
> +void sched_ext_free(struct task_struct *p)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&scx_tasks_lock, flags);
> +	list_del_init(&p->scx.tasks_node);
> +	spin_unlock_irqrestore(&scx_tasks_lock, flags);
> +
> +	/*
> +	 * @p is off scx_tasks and wholly ours. scx_ops_enable()'s PREPPED ->
> +	 * ENABLED transitions can't race us. Disable ops for @p.
> +	 */
> +	if (p->scx.flags & (SCX_TASK_OPS_PREPPED | SCX_TASK_OPS_ENABLED)) {
> +		struct rq_flags rf;
> +		struct rq *rq;
> +
> +		rq = task_rq_lock(p, &rf);
> +		scx_ops_disable_task(p);
> +		task_rq_unlock(rq, p, &rf);
> +	}
> +}
> +
> +static void reweight_task_scx(struct rq *rq, struct task_struct *p, int newprio)
> +{
> +	refresh_scx_weight(p);
> +}
> +
> +static void prio_changed_scx(struct rq *rq, struct task_struct *p, int oldprio)
> +{
> +}
> +
> +static void switching_to_scx(struct rq *rq, struct task_struct *p)
> +{
> +	refresh_scx_weight(p);
> +
> +	/*
> +	 * set_cpus_allowed_scx() is not called while @p is associated with a
> +	 * different scheduler class. Keep the BPF scheduler up-to-date.
> +	 */
> +	if (SCX_HAS_OP(set_cpumask))
> +		SCX_CALL_OP(SCX_KF_REST, set_cpumask, p,
> +			    (struct cpumask *)p->cpus_ptr);
> +}
> +
> +static void wakeup_preempt_scx(struct rq *rq, struct task_struct *p,int wake_flags) {}
> +static void switched_to_scx(struct rq *rq, struct task_struct *p) {}
> +
> +/*
> + * Omitted operations:
> + *
> + * - wakeup_preempt: NOOP as it isn't useful in the wakeup path because the task
> + *   isn't tied to the CPU at that point.
> + *
> + * - migrate_task_rq: Unncessary as task to cpu mapping is transient.
> + *
> + * - task_fork/dead: We need fork/dead notifications for all tasks regardless of
> + *   their current sched_class. Call them directly from sched core instead.
> + *
> + * - task_woken, switched_from: Unnecessary.
> + */
> +DEFINE_SCHED_CLASS(ext) = {
> +	.enqueue_task		= enqueue_task_scx,
> +	.dequeue_task		= dequeue_task_scx,
> +	.yield_task		= yield_task_scx,
> +	.yield_to_task		= yield_to_task_scx,
> +
> +	.wakeup_preempt		= wakeup_preempt_scx,
> +
> +	.pick_next_task		= pick_next_task_scx,
> +
> +	.put_prev_task		= put_prev_task_scx,
> +	.set_next_task          = set_next_task_scx,
> +
> +#ifdef CONFIG_SMP
> +	.balance		= balance_scx,
> +	.select_task_rq		= select_task_rq_scx,
> +	.set_cpus_allowed	= set_cpus_allowed_scx,
> +#endif
> +
> +	.task_tick		= task_tick_scx,
> +
> +	.switching_to		= switching_to_scx,
> +	.switched_to		= switched_to_scx,
> +	.reweight_task		= reweight_task_scx,
> +	.prio_changed		= prio_changed_scx,
> +
> +	.update_curr		= update_curr_scx,
> +
> +#ifdef CONFIG_UCLAMP_TASK
> +	.uclamp_enabled		= 0,
> +#endif
> +};
> +
> +static void init_dsq(struct scx_dispatch_q *dsq, u64 dsq_id)
> +{
> +	memset(dsq, 0, sizeof(*dsq));
> +
> +	raw_spin_lock_init(&dsq->lock);
> +	INIT_LIST_HEAD(&dsq->fifo);
> +	dsq->id = dsq_id;
> +}
> +
> +static struct scx_dispatch_q *create_dsq(u64 dsq_id, int node)
> +{
> +	struct scx_dispatch_q *dsq;
> +	int ret;
> +
> +	if (dsq_id & SCX_DSQ_FLAG_BUILTIN)
> +		return ERR_PTR(-EINVAL);
> +
> +	dsq = kmalloc_node(sizeof(*dsq), GFP_KERNEL, node);
> +	if (!dsq)
> +		return ERR_PTR(-ENOMEM);
> +
> +	init_dsq(dsq, dsq_id);
> +
> +	ret = rhashtable_insert_fast(&dsq_hash, &dsq->hash_node,
> +				     dsq_hash_params);
> +	if (ret) {
> +		kfree(dsq);
> +		return ERR_PTR(ret);
> +	}
> +	return dsq;
> +}
> +
> +static void free_dsq_irq_workfn(struct irq_work *irq_work)
> +{
> +	struct llist_node *to_free = llist_del_all(&dsqs_to_free);
> +	struct scx_dispatch_q *dsq, *tmp_dsq;
> +
> +	llist_for_each_entry_safe(dsq, tmp_dsq, to_free, free_node)
> +		kfree_rcu(dsq, rcu);
> +}
> +
> +static DEFINE_IRQ_WORK(free_dsq_irq_work, free_dsq_irq_workfn);
> +
> +static void destroy_dsq(u64 dsq_id)
> +{
> +	struct scx_dispatch_q *dsq;
> +	unsigned long flags;
> +
> +	rcu_read_lock();
> +
> +	dsq = rhashtable_lookup_fast(&dsq_hash, &dsq_id, dsq_hash_params);
> +	if (!dsq)
> +		goto out_unlock_rcu;
> +
> +	raw_spin_lock_irqsave(&dsq->lock, flags);
> +
> +	if (dsq->nr) {
> +		scx_ops_error("attempting to destroy in-use dsq 0x%016llx (nr=%u)",
> +			      dsq->id, dsq->nr);
> +		goto out_unlock_dsq;
> +	}
> +
> +	if (rhashtable_remove_fast(&dsq_hash, &dsq->hash_node, dsq_hash_params))
> +		goto out_unlock_dsq;
> +
> +	/*
> +	 * Mark dead by invalidating ->id to prevent dispatch_enqueue() from
> +	 * queueing more tasks. As this function can be called from anywhere,
> +	 * freeing is bounced through an irq work to avoid nesting RCU
> +	 * operations inside scheduler locks.
> +	 */
> +	dsq->id = SCX_DSQ_INVALID;
> +	llist_add(&dsq->free_node, &dsqs_to_free);
> +	irq_work_queue(&free_dsq_irq_work);
> +
> +out_unlock_dsq:
> +	raw_spin_unlock_irqrestore(&dsq->lock, flags);
> +out_unlock_rcu:
> +	rcu_read_unlock();
> +}
> +
> +/*
> + * Used by sched_fork() and __setscheduler_prio() to pick the matching
> + * sched_class. dl/rt are already handled.
> + */
> +bool task_should_scx(struct task_struct *p)
> +{
> +	if (!scx_enabled() || scx_ops_disabling())
> +		return false;
> +	return p->policy == SCHED_EXT;
> +}
> +
> +static void scx_ops_fallback_enqueue(struct task_struct *p, u64 enq_flags)
> +{
> +	if (enq_flags & SCX_ENQ_LAST)
> +		scx_bpf_dispatch(p, SCX_DSQ_LOCAL, SCX_SLICE_DFL, enq_flags);
> +	else
> +		scx_bpf_dispatch(p, SCX_DSQ_GLOBAL, SCX_SLICE_DFL, enq_flags);
> +}
> +
> +static void scx_ops_fallback_dispatch(s32 cpu, struct task_struct *prev) {}
> +
> +static void scx_ops_disable_workfn(struct kthread_work *work)
> +{
> +	struct scx_exit_info *ei = &scx_exit_info;
> +	struct scx_task_iter sti;
> +	struct task_struct *p;
> +	struct rhashtable_iter rht_iter;
> +	struct scx_dispatch_q *dsq;
> +	const char *reason;
> +	int i, kind;
> +
> +	kind = atomic_read(&scx_exit_kind);
> +	while (true) {
> +		/*
> +		 * NONE indicates that a new scx_ops has been registered since
> +		 * disable was scheduled - don't kill the new ops. DONE
> +		 * indicates that the ops has already been disabled.
> +		 */
> +		if (kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE)
> +			return;
> +		if (atomic_try_cmpxchg(&scx_exit_kind, &kind, SCX_EXIT_DONE))
> +			break;
> +	}
> +
> +	switch (kind) {
> +	case SCX_EXIT_UNREG:
> +		reason = "BPF scheduler unregistered";
> +		break;
> +	case SCX_EXIT_ERROR:
> +		reason = "runtime error";
> +		break;
> +	case SCX_EXIT_ERROR_BPF:
> +		reason = "scx_bpf_error";
> +		break;
> +	default:
> +		reason = "<UNKNOWN>";
> +	}
> +
> +	ei->kind = kind;
> +	strlcpy(ei->reason, reason, sizeof(ei->reason));
> +
> +	switch (scx_ops_set_enable_state(SCX_OPS_DISABLING)) {
> +	case SCX_OPS_DISABLED:
> +		pr_warn("sched_ext: ops error detected without ops (%s)\n",
> +			scx_exit_info.msg);
> +		WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
> +			     SCX_OPS_DISABLING);
> +		return;
> +	case SCX_OPS_PREPPING:
> +		goto forward_progress_guaranteed;
> +	case SCX_OPS_DISABLING:
> +		/* shouldn't happen but handle it like ENABLING if it does */
> +		WARN_ONCE(true, "sched_ext: duplicate disabling instance?");
> +		fallthrough;
> +	case SCX_OPS_ENABLING:
> +	case SCX_OPS_ENABLED:
> +		break;
> +	}
> +
> +	/*
> +	 * DISABLING is set and ops was either ENABLING or ENABLED indicating
> +	 * that the ops and static branches are set.
> +	 *
> +	 * We must guarantee that all runnable tasks make forward progress
> +	 * without trusting the BPF scheduler. We can't grab any mutexes or
> +	 * rwsems as they might be held by tasks that the BPF scheduler is
> +	 * forgetting to run, which unfortunately also excludes toggling the
> +	 * static branches.
> +	 *
> +	 * Let's work around by overriding a couple ops and modifying behaviors
> +	 * based on the DISABLING state and then cycling the tasks through
> +	 * dequeue/enqueue to force global FIFO scheduling.
> +	 *
> +	 * a. ops.enqueue() and .dispatch() are overridden for simple global
> +	 *    FIFO scheduling.
> +	 *
> +	 * b. balance_scx() never sets %SCX_TASK_BAL_KEEP as the slice value
> +	 *    can't be trusted. Whenever a tick triggers, the running task is
> +	 *    rotated to the tail of the queue.
> +	 *
> +	 * c. pick_next_task() suppresses zero slice warning.
> +	 */
> +	scx_ops.enqueue = scx_ops_fallback_enqueue;
> +	scx_ops.dispatch = scx_ops_fallback_dispatch;
> +
> +	spin_lock_irq(&scx_tasks_lock);
> +	scx_task_iter_init(&sti);
> +	while ((p = scx_task_iter_next_filtered_locked(&sti))) {
> +		if (READ_ONCE(p->__state) != TASK_DEAD) {
> +			struct sched_enq_and_set_ctx ctx;
> +
> +			/* cycling deq/enq is enough, see above */
> +			sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
> +			sched_enq_and_set_task(&ctx);
> +		}
> +	}
> +	scx_task_iter_exit(&sti);
> +	spin_unlock_irq(&scx_tasks_lock);
> +
> +forward_progress_guaranteed:
> +	/*
> +	 * Here, every runnable task is guaranteed to make forward progress and
> +	 * we can safely use blocking synchronization constructs. Actually
> +	 * disable ops.
> +	 */
> +	mutex_lock(&scx_ops_enable_mutex);
> +
> +	/* avoid racing against fork */
> +	cpus_read_lock();
> +	percpu_down_write(&scx_fork_rwsem);
> +
> +	spin_lock_irq(&scx_tasks_lock);
> +	scx_task_iter_init(&sti);
> +	while ((p = scx_task_iter_next_filtered_locked(&sti))) {
> +		const struct sched_class *old_class = p->sched_class;
> +		struct sched_enq_and_set_ctx ctx;
> +		bool alive = READ_ONCE(p->__state) != TASK_DEAD;
> +
> +		sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE, &ctx);
> +
> +		p->scx.slice = min_t(u64, p->scx.slice, SCX_SLICE_DFL);
> +
> +		__setscheduler_prio(p, p->prio);
> +		if (alive)
> +			check_class_changing(task_rq(p), p, old_class);
> +
> +		sched_enq_and_set_task(&ctx);
> +
> +		if (alive)
> +			check_class_changed(task_rq(p), p, old_class, p->prio);
> +
> +		scx_ops_disable_task(p);
> +	}
> +	scx_task_iter_exit(&sti);
> +	spin_unlock_irq(&scx_tasks_lock);
> +
> +	/* no task is on scx, turn off all the switches and flush in-progress calls */
> +	static_branch_disable_cpuslocked(&__scx_ops_enabled);
> +	for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
> +		static_branch_disable_cpuslocked(&scx_has_op[i]);
> +	static_branch_disable_cpuslocked(&scx_ops_enq_last);
> +	static_branch_disable_cpuslocked(&scx_ops_enq_exiting);
> +	static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
> +	synchronize_rcu();
> +
> +	percpu_up_write(&scx_fork_rwsem);
> +	cpus_read_unlock();
> +
> +	if (ei->kind >= SCX_EXIT_ERROR) {
> +		printk(KERN_ERR "sched_ext: BPF scheduler \"%s\" errored, disabling\n", scx_ops.name);
> +
> +		if (ei->msg[0] == '\0')
> +			printk(KERN_ERR "sched_ext: %s\n", ei->reason);
> +		else
> +			printk(KERN_ERR "sched_ext: %s (%s)\n", ei->reason, ei->msg);
> +
> +		stack_trace_print(ei->bt, ei->bt_len, 2);
> +	}
> +
> +	if (scx_ops.exit)
> +		SCX_CALL_OP(SCX_KF_UNLOCKED, exit, ei);
> +
> +	memset(&scx_ops, 0, sizeof(scx_ops));
> +
> +	rhashtable_walk_enter(&dsq_hash, &rht_iter);
> +	do {
> +		rhashtable_walk_start(&rht_iter);
> +
> +		while ((dsq = rhashtable_walk_next(&rht_iter)) && !IS_ERR(dsq))
> +			destroy_dsq(dsq->id);
> +
> +		rhashtable_walk_stop(&rht_iter);
> +	} while (dsq == ERR_PTR(-EAGAIN));
> +	rhashtable_walk_exit(&rht_iter);
> +
> +	free_percpu(scx_dsp_buf);
> +	scx_dsp_buf = NULL;
> +	scx_dsp_max_batch = 0;
> +
> +	mutex_unlock(&scx_ops_enable_mutex);
> +
> +	WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_DISABLED) !=
> +		     SCX_OPS_DISABLING);
> +}
> +
> +static DEFINE_KTHREAD_WORK(scx_ops_disable_work, scx_ops_disable_workfn);
> +
> +static void schedule_scx_ops_disable_work(void)
> +{
> +	struct kthread_worker *helper = READ_ONCE(scx_ops_helper);
> +
> +	/*
> +	 * We may be called spuriously before the first bpf_sched_ext_reg(). If
> +	 * scx_ops_helper isn't set up yet, there's nothing to do.
> +	 */
> +	if (helper)
> +		kthread_queue_work(helper, &scx_ops_disable_work);
> +}
> +
> +static void scx_ops_disable(enum scx_exit_kind kind)
> +{
> +	int none = SCX_EXIT_NONE;
> +
> +	if (WARN_ON_ONCE(kind == SCX_EXIT_NONE || kind == SCX_EXIT_DONE))
> +		kind = SCX_EXIT_ERROR;
> +
> +	atomic_try_cmpxchg(&scx_exit_kind, &none, kind);
> +
> +	schedule_scx_ops_disable_work();
> +}
> +
> +static void scx_ops_error_irq_workfn(struct irq_work *irq_work)
> +{
> +	schedule_scx_ops_disable_work();
> +}
> +
> +static DEFINE_IRQ_WORK(scx_ops_error_irq_work, scx_ops_error_irq_workfn);
> +
> +__printf(2, 3) static void scx_ops_error_kind(enum scx_exit_kind kind,
> +					      const char *fmt, ...)
> +{
> +	struct scx_exit_info *ei = &scx_exit_info;
> +	int none = SCX_EXIT_NONE;
> +	va_list args;
> +
> +	if (!atomic_try_cmpxchg(&scx_exit_kind, &none, kind))
> +		return;
> +
> +	ei->bt_len = stack_trace_save(ei->bt, ARRAY_SIZE(ei->bt), 1);
> +
> +	va_start(args, fmt);
> +	vscnprintf(ei->msg, ARRAY_SIZE(ei->msg), fmt, args);
> +	va_end(args);
> +
> +	irq_work_queue(&scx_ops_error_irq_work);
> +}
> +
> +static struct kthread_worker *scx_create_rt_helper(const char *name)
> +{
> +	struct kthread_worker *helper;
> +
> +	helper = kthread_create_worker(0, name);
> +	if (helper)
> +		sched_set_fifo(helper->task);
> +	return helper;
> +}
> +
> +static int scx_ops_enable(struct sched_ext_ops *ops)
> +{
> +	struct scx_task_iter sti;
> +	struct task_struct *p;
> +	int i, ret;
> +
> +	mutex_lock(&scx_ops_enable_mutex);
> +
> +	if (!scx_ops_helper) {
> +		WRITE_ONCE(scx_ops_helper,
> +			   scx_create_rt_helper("sched_ext_ops_helper"));
> +		if (!scx_ops_helper) {
> +			ret = -ENOMEM;
> +			goto err_unlock;
> +		}
> +	}
> +
> +	if (scx_ops_enable_state() != SCX_OPS_DISABLED) {
> +		ret = -EBUSY;
> +		goto err_unlock;
> +	}
> +
> +	/*
> +	 * Set scx_ops, transition to PREPPING and clear exit info to arm the
> +	 * disable path. Failure triggers full disabling from here on.
> +	 */
> +	scx_ops = *ops;
> +
> +	WARN_ON_ONCE(scx_ops_set_enable_state(SCX_OPS_PREPPING) !=
> +		     SCX_OPS_DISABLED);
> +
> +	memset(&scx_exit_info, 0, sizeof(scx_exit_info));
> +	atomic_set(&scx_exit_kind, SCX_EXIT_NONE);
> +	scx_warned_zero_slice = false;
> +
> +	/*
> +	 * Keep CPUs stable during enable so that the BPF scheduler can track
> +	 * online CPUs by watching ->on/offline_cpu() after ->init().
> +	 */
> +	cpus_read_lock();
> +
> +	if (scx_ops.init) {
> +		ret = SCX_CALL_OP_RET(SCX_KF_INIT, init);
> +		if (ret) {
> +			ret = ops_sanitize_err("init", ret);
> +			goto err_disable;
> +		}
> +
> +		/*
> +		 * Exit early if ops.init() triggered scx_bpf_error(). Not
> +		 * strictly necessary as we'll fail transitioning into ENABLING
> +		 * later but that'd be after calling ops.prep_enable() on all
> +		 * tasks and with -EBUSY which isn't very intuitive. Let's exit
> +		 * early with success so that the condition is notified through
> +		 * ops.exit() like other scx_bpf_error() invocations.
> +		 */
> +		if (atomic_read(&scx_exit_kind) != SCX_EXIT_NONE)
> +			goto err_disable;
> +	}
> +
> +	WARN_ON_ONCE(scx_dsp_buf);
> +	scx_dsp_max_batch = ops->dispatch_max_batch ?: SCX_DSP_DFL_MAX_BATCH;
> +	scx_dsp_buf = __alloc_percpu(sizeof(scx_dsp_buf[0]) * scx_dsp_max_batch,
> +				     __alignof__(scx_dsp_buf[0]));
> +	if (!scx_dsp_buf) {
> +		ret = -ENOMEM;
> +		goto err_disable;
> +	}
> +
> +	/*
> +	 * Lock out forks before opening the floodgate so that they don't wander
> +	 * into the operations prematurely.
> +	 */
> +	percpu_down_write(&scx_fork_rwsem);
> +
> +	for (i = 0; i < SCX_NR_ONLINE_OPS; i++)
> +		if (((void (**)(void))ops)[i])
> +			static_branch_enable_cpuslocked(&scx_has_op[i]);
> +
> +	if (ops->flags & SCX_OPS_ENQ_LAST)
> +		static_branch_enable_cpuslocked(&scx_ops_enq_last);
> +
> +	if (ops->flags & SCX_OPS_ENQ_EXITING)
> +		static_branch_enable_cpuslocked(&scx_ops_enq_exiting);
> +
> +	if (!ops->update_idle || (ops->flags & SCX_OPS_KEEP_BUILTIN_IDLE)) {
> +		reset_idle_masks();
> +		static_branch_enable_cpuslocked(&scx_builtin_idle_enabled);
> +	} else {
> +		static_branch_disable_cpuslocked(&scx_builtin_idle_enabled);
> +	}
> +
> +	static_branch_enable_cpuslocked(&__scx_ops_enabled);
> +
> +	/*
> +	 * Enable ops for every task. Fork is excluded by scx_fork_rwsem
> +	 * preventing new tasks from being added. No need to exclude tasks
> +	 * leaving as sched_ext_free() can handle both prepped and enabled
> +	 * tasks. Prep all tasks first and then enable them with preemption
> +	 * disabled.
> +	 */
> +	spin_lock_irq(&scx_tasks_lock);
> +
> +	scx_task_iter_init(&sti);
> +	while ((p = scx_task_iter_next_filtered(&sti))) {
> +		get_task_struct(p);
> +		spin_unlock_irq(&scx_tasks_lock);
> +
> +		ret = scx_ops_prepare_task(p, task_group(p));
> +		if (ret) {
> +			put_task_struct(p);
> +			spin_lock_irq(&scx_tasks_lock);
> +			scx_task_iter_exit(&sti);
> +			spin_unlock_irq(&scx_tasks_lock);
> +			pr_err("sched_ext: ops.prep_enable() failed (%d) for %s[%d] while loading\n",
> +			       ret, p->comm, p->pid);
> +			goto err_disable_unlock;
> +		}
> +
> +		put_task_struct(p);
> +		spin_lock_irq(&scx_tasks_lock);
> +	}
> +	scx_task_iter_exit(&sti);
> +
> +	/*
> +	 * All tasks are prepped but are still ops-disabled. Ensure that
> +	 * %current can't be scheduled out and switch everyone.
> +	 * preempt_disable() is necessary because we can't guarantee that
> +	 * %current won't be starved if scheduled out while switching.
> +	 */
> +	preempt_disable();
> +
> +	/*
> +	 * From here on, the disable path must assume that tasks have ops
> +	 * enabled and need to be recovered.
> +	 */
> +	if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLING, SCX_OPS_PREPPING)) {
> +		preempt_enable();
> +		spin_unlock_irq(&scx_tasks_lock);
> +		ret = -EBUSY;
> +		goto err_disable_unlock;
> +	}
> +
> +	/*
> +	 * We're fully committed and can't fail. The PREPPED -> ENABLED
> +	 * transitions here are synchronized against sched_ext_free() through
> +	 * scx_tasks_lock.
> +	 */
> +	scx_task_iter_init(&sti);
> +	while ((p = scx_task_iter_next_filtered_locked(&sti))) {
> +		if (READ_ONCE(p->__state) != TASK_DEAD) {
> +			const struct sched_class *old_class = p->sched_class;
> +			struct sched_enq_and_set_ctx ctx;
> +
> +			sched_deq_and_put_task(p, DEQUEUE_SAVE | DEQUEUE_MOVE,
> +					       &ctx);
> +
> +			scx_ops_enable_task(p);
> +			__setscheduler_prio(p, p->prio);
> +			check_class_changing(task_rq(p), p, old_class);
> +
> +			sched_enq_and_set_task(&ctx);
> +
> +			check_class_changed(task_rq(p), p, old_class, p->prio);
> +		} else {
> +			scx_ops_disable_task(p);
> +		}
> +	}
> +	scx_task_iter_exit(&sti);
> +
> +	spin_unlock_irq(&scx_tasks_lock);
> +	preempt_enable();
> +	percpu_up_write(&scx_fork_rwsem);
> +
> +	if (!scx_ops_tryset_enable_state(SCX_OPS_ENABLED, SCX_OPS_ENABLING)) {
> +		ret = -EBUSY;
> +		goto err_disable;
> +	}
> +
> +	cpus_read_unlock();
> +	mutex_unlock(&scx_ops_enable_mutex);
> +
> +	return 0;
> +
> +err_unlock:
> +	mutex_unlock(&scx_ops_enable_mutex);
> +	return ret;
> +
> +err_disable_unlock:
> +	percpu_up_write(&scx_fork_rwsem);
> +err_disable:
> +	cpus_read_unlock();
> +	mutex_unlock(&scx_ops_enable_mutex);
> +	/* must be fully disabled before returning */
> +	scx_ops_disable(SCX_EXIT_ERROR);
> +	kthread_flush_work(&scx_ops_disable_work);
> +	return ret;
> +}
> +
> +#ifdef CONFIG_SCHED_DEBUG
> +static const char *scx_ops_enable_state_str[] = {
> +	[SCX_OPS_PREPPING]	= "prepping",
> +	[SCX_OPS_ENABLING]	= "enabling",
> +	[SCX_OPS_ENABLED]	= "enabled",
> +	[SCX_OPS_DISABLING]	= "disabling",
> +	[SCX_OPS_DISABLED]	= "disabled",
> +};
> +
> +static int scx_debug_show(struct seq_file *m, void *v)
> +{
> +	mutex_lock(&scx_ops_enable_mutex);
> +	seq_printf(m, "%-30s: %s\n", "ops", scx_ops.name);
> +	seq_printf(m, "%-30s: %ld\n", "enabled", scx_enabled());
> +	seq_printf(m, "%-30s: %s\n", "enable_state",
> +		   scx_ops_enable_state_str[scx_ops_enable_state()]);
> +	mutex_unlock(&scx_ops_enable_mutex);
> +	return 0;
> +}
> +
> +static int scx_debug_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, scx_debug_show, NULL);
> +}
> +
> +const struct file_operations sched_ext_fops = {
> +	.open		= scx_debug_open,
> +	.read		= seq_read,
> +	.llseek		= seq_lseek,
> +	.release	= single_release,
> +};
> +#endif
> +
> +/********************************************************************************
> + * bpf_struct_ops plumbing.
> + */
> +#include <linux/bpf_verifier.h>
> +#include <linux/bpf.h>
> +#include <linux/btf.h>
> +
> +extern struct btf *btf_vmlinux;
> +static const struct btf_type *task_struct_type;
> +
> +static bool bpf_scx_is_valid_access(int off, int size,
> +				    enum bpf_access_type type,
> +				    const struct bpf_prog *prog,
> +				    struct bpf_insn_access_aux *info)
> +{
> +	if (off < 0 || off >= sizeof(__u64) * MAX_BPF_FUNC_ARGS)
> +		return false;
> +	if (type != BPF_READ)
> +		return false;
> +	if (off % size != 0)
> +		return false;
> +
> +	return btf_ctx_access(off, size, type, prog, info);
> +}
> +
> +static int bpf_scx_btf_struct_access(struct bpf_verifier_log *log,
> +				     const struct bpf_reg_state *reg, int off,
> +				     int size)
> +{
> +	const struct btf_type *t;
> +
> +	t = btf_type_by_id(reg->btf, reg->btf_id);
> +	if (t == task_struct_type) {
> +		if (off >= offsetof(struct task_struct, scx.slice) &&
> +		    off + size <= offsetofend(struct task_struct, scx.slice))
> +			return SCALAR_VALUE;
> +	}
> +
> +	return -EACCES;
> +}
> +
> +static const struct bpf_func_proto *
> +bpf_scx_get_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
> +{
> +	switch (func_id) {
> +	case BPF_FUNC_task_storage_get:
> +		return &bpf_task_storage_get_proto;
> +	case BPF_FUNC_task_storage_delete:
> +		return &bpf_task_storage_delete_proto;
> +	default:
> +		return bpf_base_func_proto(func_id);
> +	}
> +}
> +
> +const struct bpf_verifier_ops bpf_scx_verifier_ops = {
> +	.get_func_proto = bpf_scx_get_func_proto,
> +	.is_valid_access = bpf_scx_is_valid_access,
> +	.btf_struct_access = bpf_scx_btf_struct_access,
> +};
> +
> +static int bpf_scx_init_member(const struct btf_type *t,
> +			       const struct btf_member *member,
> +			       void *kdata, const void *udata)
> +{
> +	const struct sched_ext_ops *uops = udata;
> +	struct sched_ext_ops *ops = kdata;
> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
> +	int ret;
> +
> +	switch (moff) {
> +	case offsetof(struct sched_ext_ops, dispatch_max_batch):
> +		if (*(u32 *)(udata + moff) > INT_MAX)
> +			return -E2BIG;
> +		ops->dispatch_max_batch = *(u32 *)(udata + moff);
> +		return 1;
> +	case offsetof(struct sched_ext_ops, flags):
> +		if (*(u64 *)(udata + moff) & ~SCX_OPS_ALL_FLAGS)
> +			return -EINVAL;
> +		ops->flags = *(u64 *)(udata + moff);
> +		return 1;
> +	case offsetof(struct sched_ext_ops, name):
> +		ret = bpf_obj_name_cpy(ops->name, uops->name,
> +				       sizeof(ops->name));
> +		if (ret < 0)
> +			return ret;
> +		if (ret == 0)
> +			return -EINVAL;
> +		return 1;
> +	}
> +
> +	return 0;
> +}
> +
> +static int bpf_scx_check_member(const struct btf_type *t,
> +				const struct btf_member *member,
> +				const struct bpf_prog *prog)
> +{
> +	u32 moff = __btf_member_bit_offset(t, member) / 8;
> +
> +	switch (moff) {
> +	case offsetof(struct sched_ext_ops, prep_enable):
> +	case offsetof(struct sched_ext_ops, init):
> +	case offsetof(struct sched_ext_ops, exit):
> +		break;
> +	default:
> +		if (prog->aux->sleepable)
> +			return -EINVAL;
> +	}
> +
> +	return 0;
> +}
> +
> +static int bpf_scx_reg(void *kdata)
> +{
> +	return scx_ops_enable(kdata);
> +}
> +
> +static void bpf_scx_unreg(void *kdata)
> +{
> +	scx_ops_disable(SCX_EXIT_UNREG);
> +	kthread_flush_work(&scx_ops_disable_work);
> +}
> +
> +static int bpf_scx_init(struct btf *btf)
> +{
> +	u32 type_id;
> +
> +	type_id = btf_find_by_name_kind(btf, "task_struct", BTF_KIND_STRUCT);
> +	if (type_id < 0)
> +		return -EINVAL;
> +	task_struct_type = btf_type_by_id(btf, type_id);
> +
> +	return 0;
> +}
> +
> +static int bpf_scx_update(void *kdata, void *old_kdata)
> +{
> +	/*
> +	 * sched_ext does not support updating the actively-loaded BPF
> +	 * scheduler, as registering a BPF scheduler can always fail if the
> +	 * scheduler returns an error code for e.g. ops.init(),
> +	 * ops.prep_enable(), etc. Similarly, we can always race with
> +	 * unregistration happening elsewhere, such as with sysrq.
> +	 */
> +	return -EOPNOTSUPP;
> +}
> +
> +static int bpf_scx_validate(void *kdata)
> +{
> +	return 0;
> +}
> +
> +/* "extern" to avoid sparse warning, only used in this file */
> +extern struct bpf_struct_ops bpf_sched_ext_ops;
> +
> +struct bpf_struct_ops bpf_sched_ext_ops = {
> +	.verifier_ops = &bpf_scx_verifier_ops,
> +	.reg = bpf_scx_reg,
> +	.unreg = bpf_scx_unreg,
> +	.check_member = bpf_scx_check_member,
> +	.init_member = bpf_scx_init_member,
> +	.init = bpf_scx_init,
> +	.update = bpf_scx_update,
> +	.validate = bpf_scx_validate,
> +	.name = "sched_ext_ops",
> +};
> +
> +void __init init_sched_ext_class(void)
> +{
> +	int cpu;
> +	u32 v;
> +
> +	/*
> +	 * The following is to prevent the compiler from optimizing out the enum
> +	 * definitions so that BPF scheduler implementations can use them
> +	 * through the generated vmlinux.h.
> +	 */
> +	WRITE_ONCE(v, SCX_WAKE_EXEC | SCX_ENQ_WAKEUP | SCX_DEQ_SLEEP);
> +
> +	BUG_ON(rhashtable_init(&dsq_hash, &dsq_hash_params));
> +	init_dsq(&scx_dsq_global, SCX_DSQ_GLOBAL);
> +#ifdef CONFIG_SMP
> +	BUG_ON(!alloc_cpumask_var(&idle_masks.cpu, GFP_KERNEL));
> +	BUG_ON(!alloc_cpumask_var(&idle_masks.smt, GFP_KERNEL));
> +#endif
> +	for_each_possible_cpu(cpu) {
> +		struct rq *rq = cpu_rq(cpu);
> +
> +		init_dsq(&rq->scx.local_dsq, SCX_DSQ_LOCAL);
> +	}
> +}
> +
> +
> +/********************************************************************************
> + * Helpers that can be called from the BPF scheduler.
> + */
> +#include <linux/btf_ids.h>
> +
> +/* Disables missing prototype warnings for kfuncs */
> +__diag_push();
> +__diag_ignore_all("-Wmissing-prototypes",
> +		  "Global functions as their definitions will be in vmlinux BTF");
> +
> +/**
> + * scx_bpf_create_dsq - Create a custom DSQ
> + * @dsq_id: DSQ to create
> + * @node: NUMA node to allocate from
> + *
> + * Create a custom DSQ identified by @dsq_id. Can be called from ops.init() and
> + * ops.prep_enable().
> + */
> +s32 scx_bpf_create_dsq(u64 dsq_id, s32 node)
> +{
> +	if (!scx_kf_allowed(SCX_KF_INIT | SCX_KF_SLEEPABLE))
> +		return -EINVAL;
> +
> +	if (unlikely(node >= (int)nr_node_ids ||
> +		     (node < 0 && node != NUMA_NO_NODE)))
> +		return -EINVAL;
> +	return PTR_ERR_OR_ZERO(create_dsq(dsq_id, node));
> +}
> +
> +BTF_SET8_START(scx_kfunc_ids_sleepable)
> +BTF_ID_FLAGS(func, scx_bpf_create_dsq, KF_SLEEPABLE)
> +BTF_SET8_END(scx_kfunc_ids_sleepable)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_sleepable = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_sleepable,
> +};
> +
> +static bool scx_dispatch_preamble(struct task_struct *p, u64 enq_flags)
> +{
> +	if (!scx_kf_allowed(SCX_KF_ENQUEUE | SCX_KF_DISPATCH))
> +		return false;
> +
> +	lockdep_assert_irqs_disabled();
> +
> +	if (unlikely(!p)) {
> +		scx_ops_error("called with NULL task");
> +		return false;
> +	}
> +
> +	if (unlikely(enq_flags & __SCX_ENQ_INTERNAL_MASK)) {
> +		scx_ops_error("invalid enq_flags 0x%llx", enq_flags);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static void scx_dispatch_commit(struct task_struct *p, u64 dsq_id, u64 enq_flags)
> +{
> +	struct task_struct *ddsp_task;
> +	int idx;
> +
> +	ddsp_task = __this_cpu_read(direct_dispatch_task);
> +	if (ddsp_task) {
> +		direct_dispatch(ddsp_task, p, dsq_id, enq_flags);
> +		return;
> +	}
> +
> +	idx = __this_cpu_read(scx_dsp_ctx.buf_cursor);
> +	if (unlikely(idx >= scx_dsp_max_batch)) {
> +		scx_ops_error("dispatch buffer overflow");
> +		return;
> +	}
> +
> +	this_cpu_ptr(scx_dsp_buf)[idx] = (struct scx_dsp_buf_ent){
> +		.task = p,
> +		.qseq = atomic_long_read(&p->scx.ops_state) & SCX_OPSS_QSEQ_MASK,
> +		.dsq_id = dsq_id,
> +		.enq_flags = enq_flags,
> +	};
> +	__this_cpu_inc(scx_dsp_ctx.buf_cursor);
> +}
> +
> +/**
> + * scx_bpf_dispatch - Dispatch a task into the FIFO queue of a DSQ
> + * @p: task_struct to dispatch
> + * @dsq_id: DSQ to dispatch to
> + * @slice: duration @p can run for in nsecs
> + * @enq_flags: SCX_ENQ_*
> + *
> + * Dispatch @p into the FIFO queue of the DSQ identified by @dsq_id. It is safe
> + * to call this function spuriously. Can be called from ops.enqueue() and
> + * ops.dispatch().
> + *
> + * When called from ops.enqueue(), it's for direct dispatch and @p must match
> + * the task being enqueued. Also, %SCX_DSQ_LOCAL_ON can't be used to target the
> + * local DSQ of a CPU other than the enqueueing one. Use ops.select_cpu() to be
> + * on the target CPU in the first place.
> + *
> + * When called from ops.dispatch(), there are no restrictions on @p or @dsq_id
> + * and this function can be called upto ops.dispatch_max_batch times to dispatch
> + * multiple tasks. scx_bpf_dispatch_nr_slots() returns the number of the
> + * remaining slots. scx_bpf_consume() flushes the batch and resets the counter.
> + *
> + * This function doesn't have any locking restrictions and may be called under
> + * BPF locks (in the future when BPF introduces more flexible locking).
> + *
> + * @p is allowed to run for @slice. The scheduling path is triggered on slice
> + * exhaustion. If zero, the current residual slice is maintained. If
> + * %SCX_SLICE_INF, @p never expires and the BPF scheduler must kick the CPU with
> + * scx_bpf_kick_cpu() to trigger scheduling.
> + */
> +void scx_bpf_dispatch(struct task_struct *p, u64 dsq_id, u64 slice,
> +		      u64 enq_flags)
> +{
> +	if (!scx_dispatch_preamble(p, enq_flags))
> +		return;
> +
> +	if (slice)
> +		p->scx.slice = slice;
> +	else
> +		p->scx.slice = p->scx.slice ?: 1;
> +
> +	scx_dispatch_commit(p, dsq_id, enq_flags);
> +}
> +
> +BTF_SET8_START(scx_kfunc_ids_enqueue_dispatch)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch, KF_RCU)
> +BTF_SET8_END(scx_kfunc_ids_enqueue_dispatch)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_enqueue_dispatch = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_enqueue_dispatch,
> +};
> +
> +/**
> + * scx_bpf_dispatch_nr_slots - Return the number of remaining dispatch slots
> + *
> + * Can only be called from ops.dispatch().
> + */
> +u32 scx_bpf_dispatch_nr_slots(void)
> +{
> +	if (!scx_kf_allowed(SCX_KF_DISPATCH))
> +		return 0;
> +
> +	return scx_dsp_max_batch - __this_cpu_read(scx_dsp_ctx.buf_cursor);
> +}
> +
> +/**
> + * scx_bpf_consume - Transfer a task from a DSQ to the current CPU's local DSQ
> + * @dsq_id: DSQ to consume
> + *
> + * Consume a task from the non-local DSQ identified by @dsq_id and transfer it
> + * to the current CPU's local DSQ for execution. Can only be called from
> + * ops.dispatch().
> + *
> + * This function flushes the in-flight dispatches from scx_bpf_dispatch() before
> + * trying to consume the specified DSQ. It may also grab rq locks and thus can't
> + * be called under any BPF locks.
> + *
> + * Returns %true if a task has been consumed, %false if there isn't any task to
> + * consume.
> + */
> +bool scx_bpf_consume(u64 dsq_id)
> +{
> +	struct scx_dsp_ctx *dspc = this_cpu_ptr(&scx_dsp_ctx);
> +	struct scx_dispatch_q *dsq;
> +
> +	if (!scx_kf_allowed(SCX_KF_DISPATCH))
> +		return false;
> +
> +	flush_dispatch_buf(dspc->rq, dspc->rf);
> +
> +	dsq = find_non_local_dsq(dsq_id);
> +	if (unlikely(!dsq)) {
> +		scx_ops_error("invalid DSQ ID 0x%016llx", dsq_id);
> +		return false;
> +	}
> +
> +	if (consume_dispatch_q(dspc->rq, dspc->rf, dsq)) {
> +		/*
> +		 * A successfully consumed task can be dequeued before it starts
> +		 * running while the CPU is trying to migrate other dispatched
> +		 * tasks. Bump nr_tasks to tell balance_scx() to retry on empty
> +		 * local DSQ.
> +		 */
> +		dspc->nr_tasks++;
> +		return true;
> +	} else {
> +		return false;
> +	}
> +}
> +
> +BTF_SET8_START(scx_kfunc_ids_dispatch)
> +BTF_ID_FLAGS(func, scx_bpf_dispatch_nr_slots)
> +BTF_ID_FLAGS(func, scx_bpf_consume)
> +BTF_SET8_END(scx_kfunc_ids_dispatch)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_dispatch = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_dispatch,
> +};
> +
> +/**
> + * scx_bpf_dsq_nr_queued - Return the number of queued tasks
> + * @dsq_id: id of the DSQ
> + *
> + * Return the number of tasks in the DSQ matching @dsq_id. If not found,
> + * -%ENOENT is returned. Can be called from any non-sleepable online scx_ops
> + * operations.
> + */
> +s32 scx_bpf_dsq_nr_queued(u64 dsq_id)
> +{
> +	struct scx_dispatch_q *dsq;
> +
> +	lockdep_assert(rcu_read_lock_any_held());
> +
> +	if (dsq_id == SCX_DSQ_LOCAL) {
> +		return this_rq()->scx.local_dsq.nr;
> +	} else if ((dsq_id & SCX_DSQ_LOCAL_ON) == SCX_DSQ_LOCAL_ON) {
> +		s32 cpu = dsq_id & SCX_DSQ_LOCAL_CPU_MASK;
> +
> +		if (ops_cpu_valid(cpu))
> +			return cpu_rq(cpu)->scx.local_dsq.nr;
> +	} else {
> +		dsq = find_non_local_dsq(dsq_id);
> +		if (dsq)
> +			return dsq->nr;
> +	}
> +	return -ENOENT;
> +}
> +
> +/**
> + * scx_bpf_test_and_clear_cpu_idle - Test and clear @cpu's idle state
> + * @cpu: cpu to test and clear idle for
> + *
> + * Returns %true if @cpu was idle and its idle state was successfully cleared.
> + * %false otherwise.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
> + */
> +bool scx_bpf_test_and_clear_cpu_idle(s32 cpu)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return false;
> +	}
> +
> +	if (ops_cpu_valid(cpu))
> +		return test_and_clear_cpu_idle(cpu);
> +	else
> +		return false;
> +}
> +
> +/**
> + * scx_bpf_pick_idle_cpu - Pick and claim an idle cpu
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed. Returns the picked idle cpu
> + * number on success. -%EBUSY if no matching cpu was found.
> + *
> + * Idle CPU tracking may race against CPU scheduling state transitions. For
> + * example, this function may return -%EBUSY as CPUs are transitioning into the
> + * idle state. If the caller then assumes that there will be dispatch events on
> + * the CPUs as they were all busy, the scheduler may end up stalling with CPUs
> + * idling while there are pending tasks. Use scx_bpf_pick_any_cpu() and
> + * scx_bpf_kick_cpu() to guarantee that there will be at least one dispatch
> + * event in the near future.
> + *
> + * Unavailable if ops.update_idle() is implemented and
> + * %SCX_OPS_KEEP_BUILTIN_IDLE is not set.
> + */
> +s32 scx_bpf_pick_idle_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return -EBUSY;
> +	}
> +
> +	return scx_pick_idle_cpu(cpus_allowed, flags);
> +}
> +
> +/**
> + * scx_bpf_pick_any_cpu - Pick and claim an idle cpu if available or pick any CPU
> + * @cpus_allowed: Allowed cpumask
> + * @flags: %SCX_PICK_IDLE_CPU_* flags
> + *
> + * Pick and claim an idle cpu in @cpus_allowed. If none is available, pick any
> + * CPU in @cpus_allowed. Guaranteed to succeed and returns the picked idle cpu
> + * number if @cpus_allowed is not empty. -%EBUSY is returned if @cpus_allowed is
> + * empty.
> + *
> + * If ops.update_idle() is implemented and %SCX_OPS_KEEP_BUILTIN_IDLE is not
> + * set, this function can't tell which CPUs are idle and will always pick any
> + * CPU.
> + */
> +s32 scx_bpf_pick_any_cpu(const struct cpumask *cpus_allowed, u64 flags)
> +{
> +	s32 cpu;
> +
> +	if (static_branch_likely(&scx_builtin_idle_enabled)) {
> +		cpu = scx_pick_idle_cpu(cpus_allowed, flags);
> +		if (cpu >= 0)
> +			return cpu;
> +	}
> +
> +	cpu = cpumask_any_distribute(cpus_allowed);
> +	if (cpu < nr_cpu_ids)
> +		return cpu;
> +	else
> +		return -EBUSY;
> +}
> +
> +/**
> + * scx_bpf_get_idle_cpumask - Get a referenced kptr to the idle-tracking
> + * per-CPU cpumask.
> + *
> + * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> + */
> +const struct cpumask *scx_bpf_get_idle_cpumask(void)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return cpu_none_mask;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	return idle_masks.cpu;
> +#else
> +	return cpu_none_mask;
> +#endif
> +}
> +
> +/**
> + * scx_bpf_get_idle_smtmask - Get a referenced kptr to the idle-tracking,
> + * per-physical-core cpumask. Can be used to determine if an entire physical
> + * core is free.
> + *
> + * Returns NULL if idle tracking is not enabled, or running on a UP kernel.
> + */
> +const struct cpumask *scx_bpf_get_idle_smtmask(void)
> +{
> +	if (!static_branch_likely(&scx_builtin_idle_enabled)) {
> +		scx_ops_error("built-in idle tracking is disabled");
> +		return cpu_none_mask;
> +	}
> +
> +#ifdef CONFIG_SMP
> +	if (sched_smt_active())
> +		return idle_masks.smt;
> +	else
> +		return idle_masks.cpu;
> +#else
> +	return cpu_none_mask;
> +#endif
> +}
> +
> +/**
> + * scx_bpf_put_idle_cpumask - Release a previously acquired referenced kptr to
> + * either the percpu, or SMT idle-tracking cpumask.
> + */
> +void scx_bpf_put_idle_cpumask(const struct cpumask *idle_mask)
> +{
> +	/*
> +	 * Empty function body because we aren't actually acquiring or
> +	 * releasing a reference to a global idle cpumask, which is read-only
> +	 * in the caller and is never released. The acquire / release semantics
> +	 * here are just used to make the cpumask is a trusted pointer in the
> +	 * caller.
> +	 */
> +}
> +
> +struct scx_bpf_error_bstr_bufs {
> +	u64			data[MAX_BPRINTF_VARARGS];
> +	char			msg[SCX_EXIT_MSG_LEN];
> +};
> +
> +static DEFINE_PER_CPU(struct scx_bpf_error_bstr_bufs, scx_bpf_error_bstr_bufs);
> +
> +/**
> + * scx_bpf_error_bstr - Indicate fatal error
> + * @fmt: error message format string
> + * @data: format string parameters packaged using ___bpf_fill() macro
> + * @data__sz: @data len, must end in '__sz' for the verifier
> + *
> + * Indicate that the BPF scheduler encountered a fatal error and initiate ops
> + * disabling.
> + */
> +void scx_bpf_error_bstr(char *fmt, unsigned long long *data, u32 data__sz)
> +{
> +	struct bpf_bprintf_data bprintf_data = { .get_bin_args = true };
> +	struct scx_bpf_error_bstr_bufs *bufs;
> +	unsigned long flags;
> +	int ret;
> +
> +	local_irq_save(flags);
> +	bufs = this_cpu_ptr(&scx_bpf_error_bstr_bufs);
> +
> +	if (data__sz % 8 || data__sz > MAX_BPRINTF_VARARGS * 8 ||
> +	    (data__sz && !data)) {
> +		scx_ops_error("invalid data=%p and data__sz=%u",
> +			      (void *)data, data__sz);
> +		goto out_restore;
> +	}
> +
> +	ret = copy_from_kernel_nofault(bufs->data, data, data__sz);
> +	if (ret) {
> +		scx_ops_error("failed to read data fields (%d)", ret);
> +		goto out_restore;
> +	}
> +
> +	ret = bpf_bprintf_prepare(fmt, UINT_MAX, bufs->data, data__sz / 8,
> +				  &bprintf_data);
> +	if (ret < 0) {
> +		scx_ops_error("failed to format prepration (%d)", ret);
> +		goto out_restore;
> +	}
> +
> +	ret = bstr_printf(bufs->msg, sizeof(bufs->msg), fmt,
> +			  bprintf_data.bin_args);
> +	bpf_bprintf_cleanup(&bprintf_data);
> +	if (ret < 0) {
> +		scx_ops_error("scx_ops_error(\"%s\", %p, %u) failed to format",
> +			      fmt, data, data__sz);
> +		goto out_restore;
> +	}
> +
> +	scx_ops_error_kind(SCX_EXIT_ERROR_BPF, "%s", bufs->msg);
> +out_restore:
> +	local_irq_restore(flags);
> +}
> +
> +/**
> + * scx_bpf_destroy_dsq - Destroy a custom DSQ
> + * @dsq_id: DSQ to destroy
> + *
> + * Destroy the custom DSQ identified by @dsq_id. Only DSQs created with
> + * scx_bpf_create_dsq() can be destroyed. The caller must ensure that the DSQ is
> + * empty and no further tasks are dispatched to it. Ignored if called on a DSQ
> + * which doesn't exist. Can be called from any online scx_ops operations.
> + */
> +void scx_bpf_destroy_dsq(u64 dsq_id)
> +{
> +	destroy_dsq(dsq_id);
> +}
> +
> +/**
> + * scx_bpf_task_running - Is task currently running?
> + * @p: task of interest
> + */
> +bool scx_bpf_task_running(const struct task_struct *p)
> +{
> +	return task_rq(p)->curr == p;
> +}
> +
> +/**
> + * scx_bpf_task_cpu - CPU a task is currently associated with
> + * @p: task of interest
> + */
> +s32 scx_bpf_task_cpu(const struct task_struct *p)
> +{
> +	return task_cpu(p);
> +}
> +
> +BTF_SET8_START(scx_kfunc_ids_ops_only)
> +BTF_ID_FLAGS(func, scx_bpf_dsq_nr_queued)
> +BTF_ID_FLAGS(func, scx_bpf_test_and_clear_cpu_idle)
> +BTF_ID_FLAGS(func, scx_bpf_pick_idle_cpu, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_pick_any_cpu, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_destroy_dsq)
> +BTF_SET8_END(scx_kfunc_ids_ops_only)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_ops_only = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_ops_only,
> +};
> +
> +BTF_SET8_START(scx_kfunc_ids_any)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_cpumask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_get_idle_smtmask, KF_ACQUIRE)
> +BTF_ID_FLAGS(func, scx_bpf_put_idle_cpumask, KF_RELEASE)
> +BTF_ID_FLAGS(func, scx_bpf_error_bstr, KF_TRUSTED_ARGS)
> +BTF_ID_FLAGS(func, scx_bpf_task_running, KF_RCU)
> +BTF_ID_FLAGS(func, scx_bpf_task_cpu, KF_RCU)
> +BTF_SET8_END(scx_kfunc_ids_any)
> +
> +static const struct btf_kfunc_id_set scx_kfunc_set_any = {
> +	.owner			= THIS_MODULE,
> +	.set			= &scx_kfunc_ids_any,
> +};
> +
> +__diag_pop();
> +
> +/*
> + * This can't be done from init_sched_ext_class() as register_btf_kfunc_id_set()
> + * needs most of the system to be up.
> + */
> +static int __init register_ext_kfuncs(void)
> +{
> +	int ret;
> +
> +	/*
> +	 * Some kfuncs are context-sensitive and can only be called from
> +	 * specific SCX ops. They are grouped into BTF sets accordingly.
> +	 * Unfortunately, BPF currently doesn't have a way of enforcing such
> +	 * restrictions. Eventually, the verifier should be able to enforce
> +	 * them. For now, register them the same and make each kfunc explicitly
> +	 * check using scx_kf_allowed().
> +	 */
> +	if ((ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					     &scx_kfunc_set_sleepable)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					     &scx_kfunc_set_enqueue_dispatch)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					     &scx_kfunc_set_dispatch)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					     &scx_kfunc_set_ops_only)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_STRUCT_OPS,
> +					     &scx_kfunc_set_any)) ||
> +	    (ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING,
> +					     &scx_kfunc_set_any))) {
> +		pr_err("sched_ext: failed to register kfunc sets (%d)\n", ret);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +__initcall(register_ext_kfuncs);
> diff --git a/kernel/sched/ext.h b/kernel/sched/ext.h
> index 6a93c4825339..753860e985ae 100644
> --- a/kernel/sched/ext.h
> +++ b/kernel/sched/ext.h
> @@ -1,11 +1,119 @@
>   /* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (c) 2022 Meta Platforms, Inc. and affiliates.
> + * Copyright (c) 2022 Tejun Heo <tj@...nel.org>
> + * Copyright (c) 2022 David Vernet <dvernet@...a.com>
> + */
> +enum scx_wake_flags {
> +	/* expose select WF_* flags as enums */
> +	SCX_WAKE_EXEC		= WF_EXEC,
> +	SCX_WAKE_FORK		= WF_FORK,
> +	SCX_WAKE_TTWU		= WF_TTWU,
> +	SCX_WAKE_SYNC		= WF_SYNC,
> +};
> +
> +enum scx_enq_flags {
> +	/* expose select ENQUEUE_* flags as enums */
> +	SCX_ENQ_WAKEUP		= ENQUEUE_WAKEUP,
> +	SCX_ENQ_HEAD		= ENQUEUE_HEAD,
> +
> +	/* high 32bits are SCX specific */
> +
> +	/*
> +	 * The task being enqueued is the only task available for the cpu. By
> +	 * default, ext core keeps executing such tasks but when
> +	 * %SCX_OPS_ENQ_LAST is specified, they're ops.enqueue()'d with
> +	 * %SCX_ENQ_LAST and %SCX_ENQ_LOCAL flags set.
> +	 *
> +	 * If the BPF scheduler wants to continue executing the task,
> +	 * ops.enqueue() should dispatch the task to %SCX_DSQ_LOCAL immediately.
> +	 * If the task gets queued on a different dsq or the BPF side, the BPF
> +	 * scheduler is responsible for triggering a follow-up scheduling event.
> +	 * Otherwise, Execution may stall.
> +	 */
> +	SCX_ENQ_LAST		= 1LLU << 41,
> +
> +	/*
> +	 * A hint indicating that it's advisable to enqueue the task on the
> +	 * local dsq of the currently selected CPU. Currently used by
> +	 * select_cpu_dfl() and together with %SCX_ENQ_LAST.
> +	 */
> +	SCX_ENQ_LOCAL		= 1LLU << 42,
> +
> +	/* high 8 bits are internal */
> +	__SCX_ENQ_INTERNAL_MASK	= 0xffLLU << 56,
> +
> +	SCX_ENQ_CLEAR_OPSS	= 1LLU << 56,
> +};
> +
> +enum scx_deq_flags {
> +	/* expose select DEQUEUE_* flags as enums */
> +	SCX_DEQ_SLEEP		= DEQUEUE_SLEEP,
> +};
> +
> +enum scx_pick_idle_cpu_flags {
> +	SCX_PICK_IDLE_CORE	= 1LLU << 0,	/* pick a CPU whose SMT siblings are also idle */
> +};
>   
>   #ifdef CONFIG_SCHED_CLASS_EXT
> -#error "NOT IMPLEMENTED YET"
> +
> +struct sched_enq_and_set_ctx {
> +	struct task_struct	*p;
> +	int			queue_flags;
> +	bool			queued;
> +	bool			running;
> +};
> +
> +void sched_deq_and_put_task(struct task_struct *p, int queue_flags,
> +			    struct sched_enq_and_set_ctx *ctx);
> +void sched_enq_and_set_task(struct sched_enq_and_set_ctx *ctx);
> +
> +extern const struct sched_class ext_sched_class;
> +extern const struct bpf_verifier_ops bpf_sched_ext_verifier_ops;
> +extern const struct file_operations sched_ext_fops;
> +
> +DECLARE_STATIC_KEY_FALSE(__scx_ops_enabled);
> +#define scx_enabled()		static_branch_unlikely(&__scx_ops_enabled)
> +
> +static inline bool task_on_scx(const struct task_struct *p)
> +{
> +	return scx_enabled() && p->sched_class == &ext_sched_class;
> +}
> +
> +bool task_should_scx(struct task_struct *p);
> +void scx_pre_fork(struct task_struct *p);
> +int scx_fork(struct task_struct *p);
> +void scx_post_fork(struct task_struct *p);
> +void scx_cancel_fork(struct task_struct *p);
> +void init_sched_ext_class(void);
> +
> +static inline const struct sched_class *next_active_class(const struct sched_class *class)
> +{
> +	class++;
> +	if (!scx_enabled() && class == &ext_sched_class)
> +		class++;
> +	return class;
> +}
> +
> +#define for_active_class_range(class, _from, _to)				\
> +	for (class = (_from); class != (_to); class = next_active_class(class))
> +
> +#define for_each_active_class(class)						\
> +	for_active_class_range(class, __sched_class_highest, __sched_class_lowest)
> +
> +/*
> + * SCX requires a balance() call before every pick_next_task() call including
> + * when waking up from idle.
> + */
> +#define for_balance_class_range(class, prev_class, end_class)			\
> +	for_active_class_range(class, (prev_class) > &ext_sched_class ?		\
> +			       &ext_sched_class : (prev_class), (end_class))
> +
>   #else	/* CONFIG_SCHED_CLASS_EXT */
>   
>   #define scx_enabled()		false
>   
> +static inline bool task_on_scx(const struct task_struct *p) { return false; }
>   static inline void scx_pre_fork(struct task_struct *p) {}
>   static inline int scx_fork(struct task_struct *p) { return 0; }
>   static inline void scx_post_fork(struct task_struct *p) {}
> @@ -18,7 +126,13 @@ static inline void init_sched_ext_class(void) {}
>   #endif	/* CONFIG_SCHED_CLASS_EXT */
>   
>   #if defined(CONFIG_SCHED_CLASS_EXT) && defined(CONFIG_SMP)
> -#error "NOT IMPLEMENTED YET"
> +void __scx_update_idle(struct rq *rq, bool idle);
> +
> +static inline void scx_update_idle(struct rq *rq, bool idle)
> +{
> +	if (scx_enabled())
> +		__scx_update_idle(rq, idle);
> +}
>   #else
>   static inline void scx_update_idle(struct rq *rq, bool idle) {}
>   #endif
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5215e3bd234a..e27545e5df0b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -174,6 +174,10 @@ static inline int idle_policy(int policy)
>   
>   static inline int normal_policy(int policy)
>   {
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	if (policy == SCHED_EXT)
> +		return true;
> +#endif
>   	return policy == SCHED_NORMAL;
>   }
>   
> @@ -668,6 +672,15 @@ struct cfs_rq {
>   #endif /* CONFIG_FAIR_GROUP_SCHED */
>   };
>   
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +struct scx_rq {
> +	struct scx_dispatch_q	local_dsq;
> +	unsigned long		ops_qseq;
> +	u64			extra_enq_flags;	/* see move_task_to_local_dsq() */
> +	u32			nr_running;
> +};
> +#endif /* CONFIG_SCHED_CLASS_EXT */
> +
>   static inline int rt_bandwidth_enabled(void)
>   {
>   	return sysctl_sched_rt_runtime >= 0;
> @@ -1008,6 +1021,9 @@ struct rq {
>   	struct cfs_rq		cfs;
>   	struct rt_rq		rt;
>   	struct dl_rq		dl;
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +	struct scx_rq		scx;
> +#endif
>   
>   #ifdef CONFIG_FAIR_GROUP_SCHED
>   	/* list of leaf cfs_rq on this CPU: */

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ