[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <YW2Rjbr1FhyRZ1eB@geo.homenetwork>
Date: Mon, 18 Oct 2021 23:23:57 +0800
From: Tao Zhou <tao.zhou@...ux.dev>
To: Peter Oskolkov <posk@...k.io>
Cc: Peter Zijlstra <peterz@...radead.org>,
Ingo Molnar <mingo@...hat.com>,
Thomas Gleixner <tglx@...utronix.de>,
Andrew Morton <akpm@...ux-foundation.org>,
Dave Hansen <dave.hansen@...ux.intel.com>,
Andy Lutomirski <luto@...nel.org>, linux-mm@...ck.org,
linux-kernel@...r.kernel.org, linux-api@...r.kernel.org,
Paul Turner <pjt@...gle.com>, Ben Segall <bsegall@...gle.com>,
Peter Oskolkov <posk@...gle.com>,
Andrei Vagin <avagin@...gle.com>, Jann Horn <jannh@...gle.com>,
Thierry Delisle <tdelisle@...terloo.ca>,
Tao Zhou <tao.zhou@...ux.dev>
Subject: Re: [PATCH v0.7 3/5] sched/umcg: implement UMCG syscalls
On Tue, Oct 12, 2021 at 04:25:20PM -0700, Peter Oskolkov wrote:
> Define struct umcg_task and two syscalls: sys_umcg_ctl sys_umcg_wait.
>
> User Managed Concurrency Groups is an M:N threading toolkit that allows
> constructing user space schedulers designed to efficiently manage
> heterogeneous in-process workloads while maintaining high CPU
> utilization (95%+).
>
> In addition, M:N threading and cooperative user space scheduling
> enables synchronous coding style and better cache locality when
> compared to asynchronous callback/continuation style of programming.
>
> UMCG kernel API is build around the following ideas:
>
> * UMCG server: a task/thread representing "kernel threads", or (v)CPUs;
> * UMCG worker: a task/thread representing "application threads", to be
> scheduled over servers;
> * UMCG task state: (NONE), RUNNING, BLOCKED, IDLE: states a UMCG task (a
> server or a worker) can be in;
> * UMCG task state flag: LOCKED, PREEMPTED: additional state flags that
> can be ORed with the task state to communicate additional information to
> the kernel;
> * struct umcg_task: a per-task userspace set of data fields, usually
> residing in the TLS, that fully reflects the current task's UMCG state
> and controls the way the kernel manages the task;
> * sys_umcg_ctl(): a syscall used to register the current task/thread as a
> server or a worker, or to unregister a UMCG task;
> * sys_umcg_wait(): a syscall used to put the current task to sleep and/or
> wake another task, pontentially context-switching between the two tasks
> on-CPU synchronously.
>
> In short, servers can be thought of as CPUs over which application
> threads (workers) are scheduled; at any one time a worker is either:
> - RUNNING: has a server and is schedulable by the kernel;
> - BLOCKED: blocked in the kernel (e.g. on I/O, or a futex);
> - IDLE: is not blocked, but cannot be scheduled by the kernel to
> run because it has no server assigned to it (e.g. because all
> available servers are busy "running" other workers).
>
> Usually the number of servers in a process is equal to the number of
> CPUs available to the kernel if the process is supposed to consume
> the whole machine, or less than the number of CPUs available if the
> process is sharing the machine with other workloads. The number of
> workers in a process can grow very large: tens of thousands is normal;
> hundreds of thousands and more (millions) is something that would
> be desirable to achieve in the future, as lightweight userspace
> threads in Java and Go easily scale to millions, and UMCG workers
> are (intended to be) conceptually similar to those.
>
> Detailed use cases and API behavior are provided in
> Documentation/userspace-api/umcg.[txt|rst] (see sibling patches).
>
> Some high-level implementation notes:
>
> UMCG tasks (workers and servers) are "tagged" with struct umcg_task
> residing in userspace (usually in TLS) to facilitate kernel/userspace
> communication. This makes the kernel-side code much simpler (see e.g.
> the implementation of sys_umcg_wait), but also requires some careful
> uaccess handling and page pinning (see below).
>
> The main UMCG server/worker interaction looks like:
>
> a. worker W1 is RUNNING, with a server S attached to it sleeping
> in IDLE state;
> b. worker W1 blocks in the kernel, e.g. on I/O;
> c. the kernel marks W1 as BLOCKED, the attached server S
> as RUNNING, and wakes S (the "block detection" event);
> d. the server now picks another IDLE worker W2 to run: marks
> W2 as RUNNING, itself as IDLE, ands calls sys_umcg_wait();
> e. when the blocking operation of W1 completes, the worker
> is marked by the kernel as IDLE and added to idle workers list
> (see struct umcg_task) for the userspace to pick up and
> later run (the "wake detection" event).
>
> While there are additional operations such as worker-to-worker
> context switch, preemption, workers "yielding", etc., the "workflow"
> above is the main worker/server interaction that drives the
> implementation.
>
> Specifically:
>
> - most operations are conceptually context switches:
> - scheduling a worker: a running server goes to sleep and "runs"
> a worker in its place;
> - block detection: worker is descheduled, and its server is woken;
> - wake detection: woken worker, running in the kernel, is descheduled,
> and if there is an idle server, it is woken to process the wake
> detection event;
> - to faciliate low scheduling latencies and cache locality, most
> server/worker interactions described above are performed synchronously
> "on CPU" via WF_CURRENT_CPU flag passed to ttwu; while at the moment
> the context switches are simulated by putting the switch-out task to
> sleep and waking the switch-into task on the same cpu, it is very much
> the long-term goal of this project to make the context switch much
> lighter, by tweaking runtime accounting and, maybe, even bypassing
> __schedule();
> - worker blocking is detected in a hook to sched_submit_work; as mentioned
> above, the server is to be woken on the same CPU, synchronously;
> this code may not pagefault, so to access worker's and server's
> userspace memory (struct umcg_task), memory pages containing the worker's
> and the server's structs umcg_task are pinned when the worker is
> exiting to the userspace, and unpinned when the worker is descheduled;
> - worker wakeup is detected in a hook to sched_update_worker, and processed
> in the exit to usermode loop (via TIF_NOTIFY_RESUME); workers CAN
> pagefault on the wakeup path;
> - worker preemption is implemented by the userspace tagging the worker
> with UMCG_TF_PREEMPTED state flag and sending a NOOP signal to it;
> on the exit to usermode the worker is intercepted and its server is woken
> (see Documentation/userspace-api/umcg.[txt|rst] for more details);
> - each state change is tagged with a unique timestamp (of MONOTONIC
> variety), so that
> - scheduling instrumentation is naturally available;
> - racing state changes are easily detected and ABA issues are
> avoided;
> see umcg_update_state() in umcg.c for implementation details, and
> Documentation/userspace-api/umcg.[txt|rst] for a higher-level
> description.
>
> The previous version of the patchset can be found at
> https://lore.kernel.org/all/20210917180323.278250-1-posk@google.com/
> containing some additional context and links to earlier discussions.
>
> More details are available in Documentation/userspace-api/umcg.[txt|rst]
> in sibling patches, and in doc-comments in the code.
>
> Signed-off-by: Peter Oskolkov <posk@...gle.com>
> ---
> arch/x86/entry/syscalls/syscall_64.tbl | 2 +
> fs/exec.c | 1 +
> include/linux/sched.h | 71 ++
> include/linux/syscalls.h | 3 +
> include/uapi/asm-generic/unistd.h | 6 +-
> include/uapi/linux/umcg.h | 137 ++++
> init/Kconfig | 10 +
> kernel/entry/common.c | 4 +-
> kernel/exit.c | 5 +
> kernel/sched/Makefile | 1 +
> kernel/sched/core.c | 9 +-
> kernel/sched/umcg.c | 926 +++++++++++++++++++++++++
> kernel/sys_ni.c | 4 +
> 13 files changed, 1175 insertions(+), 4 deletions(-)
> create mode 100644 include/uapi/linux/umcg.h
> create mode 100644 kernel/sched/umcg.c
>
> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
> index 18b5500ea8bf..cb71f383060f 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -370,6 +370,8 @@
> 446 common landlock_restrict_self sys_landlock_restrict_self
> 447 common memfd_secret sys_memfd_secret
> 448 common process_mrelease sys_process_mrelease
> +449 common umcg_ctl sys_umcg_ctl
> +450 common umcg_wait sys_umcg_wait
>
> #
> # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/fs/exec.c b/fs/exec.c
> index a098c133d8d7..dfa24bb99a97 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1840,6 +1840,7 @@ static int bprm_execve(struct linux_binprm *bprm,
> current->fs->in_exec = 0;
> current->in_execve = 0;
> rseq_execve(current);
> + umcg_execve(current);
> acct_update_integrals(current);
> task_numa_free(current, false);
> return retval;
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 343603f77f8b..c7e812ceec3c 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -67,6 +67,7 @@ struct sighand_struct;
> struct signal_struct;
> struct task_delay_info;
> struct task_group;
> +struct umcg_task;
>
> /*
> * Task state bitmask. NOTE! These bits are also
> @@ -1296,6 +1297,12 @@ struct task_struct {
> unsigned long rseq_event_mask;
> #endif
>
> +#ifdef CONFIG_UMCG
> + struct umcg_task __user *umcg_task;
> + struct page *pinned_umcg_worker_page; /* self */
> + struct page *pinned_umcg_server_page;
> +#endif
> +
> struct tlbflush_unmap_batch tlb_ubc;
>
> union {
> @@ -1688,6 +1695,13 @@ extern struct pid *cad_pid;
> #define PF_KTHREAD 0x00200000 /* I am a kernel thread */
> #define PF_RANDOMIZE 0x00400000 /* Randomize virtual address space */
> #define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
> +
> +#ifdef CONFIG_UMCG
> +#define PF_UMCG_WORKER 0x01000000 /* UMCG worker */
> +#else
> +#define PF_UMCG_WORKER 0x00000000
> +#endif
> +
> #define PF_NO_SETAFFINITY 0x04000000 /* Userland is not allowed to meddle with cpus_mask */
> #define PF_MCE_EARLY 0x08000000 /* Early kill for mce process policy */
> #define PF_MEMALLOC_PIN 0x10000000 /* Allocation context constrained to zones which allow long term pinning. */
> @@ -2275,6 +2289,63 @@ static inline void rseq_execve(struct task_struct *t)
>
> #endif
>
> +#ifdef CONFIG_UMCG
> +
> +void umcg_handle_resuming_worker(void);
> +void umcg_handle_exiting_worker(void);
> +void umcg_clear_child(struct task_struct *tsk);
> +
> +/* Called by bprm_execve() in fs/exec.c. */
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> + if (tsk->umcg_task)
> + umcg_clear_child(tsk);
> +}
> +
> +/* Called by exit_to_user_mode_loop() in kernel/entry/common.c.*/
> +static inline void umcg_handle_notify_resume(void)
> +{
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_resuming_worker();
> +}
> +
> +/* Called by do_exit() in kernel/exit.c. */
> +static inline void umcg_handle_exit(void)
> +{
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_exiting_worker();
> +}
> +
> +/*
> + * umcg_wq_worker_[sleeping|running] are called in core.c by
> + * sched_submit_work() and sched_update_worker().
> + */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk);
> +void umcg_wq_worker_running(struct task_struct *tsk);
> +
> +#else /* CONFIG_UMCG */
> +
> +static inline void umcg_clear_child(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_execve(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_handle_notify_resume(void)
> +{
> +}
> +static inline void umcg_handle_exit(void)
> +{
> +}
> +static inline void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> +}
> +static inline void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> +}
> +
> +#endif
> +
> #ifdef CONFIG_DEBUG_RSEQ
>
> void rseq_syscall(struct pt_regs *regs);
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index 252243c7783d..97a05879da41 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -71,6 +71,7 @@ struct open_how;
> struct mount_attr;
> struct landlock_ruleset_attr;
> enum landlock_rule_type;
> +struct umcg_task;
>
> #include <linux/types.h>
> #include <linux/aio_abi.h>
> @@ -1052,6 +1053,8 @@ asmlinkage long sys_landlock_add_rule(int ruleset_fd, enum landlock_rule_type ru
> const void __user *rule_attr, __u32 flags);
> asmlinkage long sys_landlock_restrict_self(int ruleset_fd, __u32 flags);
> asmlinkage long sys_memfd_secret(unsigned int flags);
> +asmlinkage long sys_umcg_ctl(u32 flags, struct umcg_task __user *self);
> +asmlinkage long sys_umcg_wait(u32 flags, u64 abs_timeout);
>
> /*
> * Architecture-specific system calls
> diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
> index 1c5fb86d455a..3e3d50de5137 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -879,9 +879,13 @@ __SYSCALL(__NR_memfd_secret, sys_memfd_secret)
> #endif
> #define __NR_process_mrelease 448
> __SYSCALL(__NR_process_mrelease, sys_process_mrelease)
> +#define __NR_umcg_ctl 449
> +__SYSCALL(__NR_umcg_ctl, sys_umcg_ctl)
> +#define __NR_umcg_wait 450
> +__SYSCALL(__NR_umcg_wait, sys_umcg_wait)
>
> #undef __NR_syscalls
> -#define __NR_syscalls 449
> +#define __NR_syscalls 451
>
> /*
> * 32 bit systems traditionally used different
> diff --git a/include/uapi/linux/umcg.h b/include/uapi/linux/umcg.h
> new file mode 100644
> index 000000000000..ce4c7980b837
> --- /dev/null
> +++ b/include/uapi/linux/umcg.h
> @@ -0,0 +1,137 @@
> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
> +#ifndef _UAPI_LINUX_UMCG_H
> +#define _UAPI_LINUX_UMCG_H
> +
> +#include <linux/limits.h>
> +#include <linux/types.h>
> +
> +/*
> + * UMCG: User Managed Concurrency Groups.
> + *
> + * Syscalls (see kernel/sched/umcg.c):
> + * sys_umcg_ctl() - register/unregister UMCG tasks;
> + * sys_umcg_wait() - wait/wake/context-switch.
> + *
> + * struct umcg_task (below): controls the state of UMCG tasks.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + */
> +
> +/*
> + * UMCG task states, the first 6 bits of struct umcg_task.state_ts.
> + * The states represent the user space point of view.
> + */
> +#define UMCG_TASK_NONE 0ULL
> +#define UMCG_TASK_RUNNING 1ULL
> +#define UMCG_TASK_IDLE 2ULL
> +#define UMCG_TASK_BLOCKED 3ULL
> +
> +/* UMCG task state flags, bits 7-8 */
> +
> +/*
> + * UMCG_TF_LOCKED: locked by the userspace in preparation to calling umcg_wait.
> + */
> +#define UMCG_TF_LOCKED (1ULL << 6)
> +
> +/*
> + * UMCG_TF_PREEMPTED: the userspace indicates the worker should be preempted.
> + */
> +#define UMCG_TF_PREEMPTED (1ULL << 7)
> +
> +/* The first six bits: RUNNING, IDLE, or BLOCKED. */
> +#define UMCG_TASK_STATE_MASK 0x3fULL
> +
> +/* The full kernel state mask: the first 13 bits. */
> +#define UMCG_TASK_STATE_MASK_FULL 0x1fffULL
> +
> +/*
> + * The number of bits reserved for UMCG state timestamp in
> + * struct umcg_task.state_ts.
> + */
> +#define UMCG_STATE_TIMESTAMP_BITS 46
> +
> +/* The number of bits truncated from UMCG state timestamp. */
> +#define UMCG_STATE_TIMESTAMP_GRANULARITY 4
> +
> +/**
> + * struct umcg_task - controls the state of UMCG tasks.
> + *
> + * The struct is aligned at 64 bytes to ensure that it fits into
> + * a single cache line.
> + */
> +struct umcg_task {
> + /**
> + * @state_ts: the current state of the UMCG task described by
> + * this struct, with a unique timestamp indicating
> + * when the last state change happened.
> + *
> + * Readable/writable by both the kernel and the userspace.
> + *
> + * UMCG task state:
> + * bits 0 - 5: task state;
> + * bits 6 - 7: state flags;
> + * bits 8 - 12: reserved; must be zeroes;
> + * bits 13 - 17: for userspace use;
> + * bits 18 - 63: timestamp (see below).
> + *
> + * Timestamp: a 46-bit CLOCK_MONOTONIC timestamp, at 16ns resolution.
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
^^^^^^^
details.
> + */
> + uint64_t state_ts; /* r/w */
> +
> + /**
> + * @next_tid: the TID of the UMCG task that should be context-switched
> + * into in sys_umcg_wait(). Can be zero.
> + *
> + * Running UMCG workers must have next_tid set to point to IDLE
> + * UMCG servers.
> + *
> + * Read-only for the kernel, read/write for the userspace.
> + */
> + uint32_t next_tid; /* r */
> +
> + uint32_t flags; /* Reserved; must be zero. */
> +
> + /**
> + * @idle_workers_ptr: a single-linked list of idle workers. Can be NULL.
> + *
> + * Readable/writable by both the kernel and the userspace: the
> + * kernel adds items to the list, the userspace removes them.
> + */
> + uint64_t idle_workers_ptr; /* r/w */
> +
> + /**
> + * @idle_server_tid_ptr: a pointer pointing to a single idle server.
> + * Readonly.
> + */
> + uint64_t idle_server_tid_ptr; /* r */
> +} __attribute__((packed, aligned(8 * sizeof(__u64))));
> +
> +/**
> + * enum umcg_ctl_flag - flags to pass to sys_umcg_ctl
> + * @UMCG_CTL_REGISTER: register the current task as a UMCG task
> + * @UMCG_CTL_UNREGISTER: unregister the current task as a UMCG task
> + * @UMCG_CTL_WORKER: register the current task as a UMCG worker
> + */
> +enum umcg_ctl_flag {
> + UMCG_CTL_REGISTER = 0x00001,
> + UMCG_CTL_UNREGISTER = 0x00002,
> + UMCG_CTL_WORKER = 0x10000,
> +};
> +
> +/**
> + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
> + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
> + * @UMCG_WAIT_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
> + * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
> + * must be set.
> + */
> +enum umcg_wait_flag {
> + UMCG_WAIT_WAKE_ONLY = 1,
> + UMCG_WAIT_WF_CURRENT_CPU = 2,
> +};
> +
> +/* See Documentation/userspace-api/umcg.[txt|rst].*/
> +#define UMCG_IDLE_NODE_PENDING (1ULL)
> +
> +#endif /* _UAPI_LINUX_UMCG_H */
> diff --git a/init/Kconfig b/init/Kconfig
> index 11f8a845f259..b52a79cfb130 100644
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -1688,6 +1688,16 @@ config MEMBARRIER
>
> If unsure, say Y.
>
> +config UMCG
> + bool "Enable User Managed Concurrency Groups API"
> + depends on X86_64
> + default n
> + help
> + Enable User Managed Concurrency Groups API, which form the basis
> + for an in-process M:N userspace scheduling framework.
> + At the moment this is an experimental/RFC feature that is not
> + guaranteed to be backward-compatible.
> +
> config KALLSYMS
> bool "Load all symbols for debugging/ksymoops" if EXPERT
> default y
> diff --git a/kernel/entry/common.c b/kernel/entry/common.c
> index d5a61d565ad5..62453772a0c7 100644
> --- a/kernel/entry/common.c
> +++ b/kernel/entry/common.c
> @@ -171,8 +171,10 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> if (ti_work & (_TIF_SIGPENDING | _TIF_NOTIFY_SIGNAL))
> handle_signal_work(regs, ti_work);
>
> - if (ti_work & _TIF_NOTIFY_RESUME)
> + if (ti_work & _TIF_NOTIFY_RESUME) {
> + umcg_handle_notify_resume();
> tracehook_notify_resume(regs);
> + }
>
> /* Architecture specific TIF work */
> arch_exit_to_user_mode_work(regs, ti_work);
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 63851320ae73..c55f9df430c8 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -745,6 +745,10 @@ void __noreturn do_exit(long code)
> if (unlikely(!tsk->pid))
> panic("Attempted to kill the idle task!");
>
> + /* Turn off UMCG sched hooks. */
> + if (unlikely(tsk->flags & PF_UMCG_WORKER))
> + tsk->flags &= ~PF_UMCG_WORKER;
> +
> /*
> * If do_exit is called because this processes oopsed, it's possible
> * that get_fs() was left as KERNEL_DS, so reset it to USER_DS before
> @@ -781,6 +785,7 @@ void __noreturn do_exit(long code)
>
> io_uring_files_cancel();
> exit_signals(tsk); /* sets PF_EXITING */
> + umcg_handle_exit();
>
> /* sync mm's RSS info before statistics gathering */
> if (tsk->mm)
> diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
> index 978fcfca5871..e4e481eee1b7 100644
> --- a/kernel/sched/Makefile
> +++ b/kernel/sched/Makefile
> @@ -37,3 +37,4 @@ obj-$(CONFIG_MEMBARRIER) += membarrier.o
> obj-$(CONFIG_CPU_ISOLATION) += isolation.o
> obj-$(CONFIG_PSI) += psi.o
> obj-$(CONFIG_SCHED_CORE) += core_sched.o
> +obj-$(CONFIG_UMCG) += umcg.o
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d6da1efb5ce6..9ff63e32544a 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -4236,6 +4236,7 @@ static void __sched_fork(unsigned long clone_flags, struct task_struct *p)
> p->wake_entry.u_flags = CSD_TYPE_TTWU;
> p->migration_pending = NULL;
> #endif
> + umcg_clear_child(p);
> }
>
> DEFINE_STATIC_KEY_FALSE(sched_numa_balancing);
> @@ -6265,9 +6266,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
> * If a worker goes to sleep, notify and ask workqueue whether it
> * wants to wake up a task to maintain concurrency.
> */
> - if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> if (task_flags & PF_WQ_WORKER)
> wq_worker_sleeping(tsk);
> + else if (task_flags & PF_UMCG_WORKER)
> + umcg_wq_worker_sleeping(tsk);
> else
> io_wq_worker_sleeping(tsk);
> }
> @@ -6285,9 +6288,11 @@ static inline void sched_submit_work(struct task_struct *tsk)
>
> static void sched_update_worker(struct task_struct *tsk)
> {
> - if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) {
> + if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER | PF_UMCG_WORKER)) {
> if (tsk->flags & PF_WQ_WORKER)
> wq_worker_running(tsk);
> + else if (tsk->flags & PF_UMCG_WORKER)
> + umcg_wq_worker_running(tsk);
> else
> io_wq_worker_running(tsk);
> }
> diff --git a/kernel/sched/umcg.c b/kernel/sched/umcg.c
> new file mode 100644
> index 000000000000..bc4eeb3f5dd7
> --- /dev/null
> +++ b/kernel/sched/umcg.c
> @@ -0,0 +1,926 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +
> +/*
> + * User Managed Concurrency Groups (UMCG).
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/types.h>
> +#include <linux/uaccess.h>
> +#include <linux/umcg.h>
> +
> +#include "sched.h"
> +
> +/**
> + * get_user_nofault - get user value without sleeping.
> + *
> + * get_user() might sleep and therefore cannot be used in preempt-disabled
> + * regions.
> + */
> +#define get_user_nofault(out, uaddr) \
> +({ \
> + int ret = -EFAULT; \
> + \
> + if (access_ok((uaddr), sizeof(*(uaddr)))) { \
> + pagefault_disable(); \
> + \
> + if (!__get_user((out), (uaddr))) \
> + ret = 0; \
> + \
> + pagefault_enable(); \
> + } \
> + ret; \
> +})
> +
> +/**
> + * umcg_pin_pages: pin pages containing struct umcg_task of this worker
> + * and its server.
> + *
> + * The pages are pinned when the worker exits to the userspace and unpinned
> + * when the worker is in sched_submit_work(), i.e. when the worker is
> + * about to be removed from its runqueue. Thus at most NR_CPUS UMCG pages
> + * are pinned at any one time across the whole system.
> + *
> + * The pinning is needed so that going-to-sleep workers can access
> + * their and their servers' userspace umcg_task structs without page faults,
> + * as the code path can be executed in the context of a pagefault, with
> + * mm lock held.
> + */
> +static int umcg_pin_pages(u32 server_tid)
> +{
> + struct umcg_task __user *worker_ut = current->umcg_task;
> + struct umcg_task __user *server_ut = NULL;
> + struct task_struct *tsk;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + /* Server/worker interaction is allowed only within the same mm. */
> + if (tsk && current->mm == tsk->mm)
> + server_ut = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!server_ut)
> + return -EINVAL;
> +
> + tsk = current;
> +
> + /* worker_ut is stable, don't need to repin */
> + if (!tsk->pinned_umcg_worker_page)
> + if (1 != pin_user_pages_fast((unsigned long)worker_ut, 1, 0,
> + &tsk->pinned_umcg_worker_page))
> + return -EFAULT;
> +
> + /* server_ut may change, need to repin */
> + if (tsk->pinned_umcg_server_page) {
> + unpin_user_page(tsk->pinned_umcg_server_page);
> + tsk->pinned_umcg_server_page = NULL;
> + }
> +
> + if (1 != pin_user_pages_fast((unsigned long)server_ut, 1, 0,
> + &tsk->pinned_umcg_server_page))
> + return -EFAULT;
> +
> + return 0;
> +}
> +
> +static void umcg_unpin_pages(void)
> +{
> + struct task_struct *tsk = current;
> +
> + if (tsk->pinned_umcg_worker_page)
> + unpin_user_page(tsk->pinned_umcg_worker_page);
> + if (tsk->pinned_umcg_server_page)
> + unpin_user_page(tsk->pinned_umcg_server_page);
> +
> + tsk->pinned_umcg_worker_page = NULL;
> + tsk->pinned_umcg_server_page = NULL;
> +}
> +
> +static void umcg_clear_task(struct task_struct *tsk)
> +{
> + /*
> + * This is either called for the current task, or for a newly forked
> + * task that is not yet running, so we don't need strict atomicity
> + * below.
> + */
> + if (tsk->umcg_task) {
> + WRITE_ONCE(tsk->umcg_task, NULL);
> +
> + /* These can be simple writes - see the commment above. */
> + tsk->pinned_umcg_worker_page = NULL;
> + tsk->pinned_umcg_server_page = NULL;
> + tsk->flags &= ~PF_UMCG_WORKER;
> + }
> +}
> +
> +/* Called for a forked or execve-ed child. */
> +void umcg_clear_child(struct task_struct *tsk)
> +{
> + umcg_clear_task(tsk);
> +}
> +
> +/* Called both by normally (unregister) and abnormally exiting workers. */
> +void umcg_handle_exiting_worker(void)
> +{
> + umcg_unpin_pages();
> + umcg_clear_task(current);
> +}
> +
> +/**
> + * umcg_update_state: atomically update umcg_task.state_ts, set new timestamp.
> + * @state_ts - points to the state_ts member of struct umcg_task to update;
> + * @expected - the expected value of state_ts, including the timestamp;
> + * @desired - the desired value of state_ts, state part only;
> + * @may_fault - whether to use normal or _nofault cmpxchg.
> + *
> + * The function is basically cmpxchg(state_ts, expected, desired), with extra
> + * code to set the timestamp in @desired.
> + */
> +static int umcg_update_state(u64 __user *state_ts, u64 *expected, u64 desired,
> + bool may_fault)
> +{
> + u64 curr_ts = (*expected) >> (64 - UMCG_STATE_TIMESTAMP_BITS);
> + u64 next_ts = ktime_get_ns() >> UMCG_STATE_TIMESTAMP_GRANULARITY;
> +
> + /* Cut higher order bits. */
> + next_ts &= ((1ULL << UMCG_STATE_TIMESTAMP_BITS) - 1);
> +
> + if (next_ts == curr_ts)
> + ++next_ts;
> +
> + /* Remove an old timestamp, if any. */
> + desired &= ((1ULL << (64 - UMCG_STATE_TIMESTAMP_BITS)) - 1);
> +
> + /* Set the new timestamp. */
> + desired |= (next_ts << (64 - UMCG_STATE_TIMESTAMP_BITS));
> +
> + if (may_fault)
> + return cmpxchg_user_64(state_ts, expected, desired);
> +
> + return cmpxchg_user_64_nofault(state_ts, expected, desired);
> +}
> +
> +/**
> + * sys_umcg_ctl: (un)register the current task as a UMCG task.
> + * @flags: ORed values from enum umcg_ctl_flag; see below;
> + * @self: a pointer to struct umcg_task that describes this
> + * task and governs the behavior of sys_umcg_wait if
> + * registering; must be NULL if unregistering.
> + *
> + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
> + * UMCG workers:
> + * - @flags & UMCG_CTL_WORKER
> + * UMCG servers:
> + * - !(@flags & UMCG_CTL_WORKER)
> + *
> + * All tasks:
> + * - self->state must be UMCG_TASK_RUNNING
> + * - self->next_tid must be zero
> + *
> + * If the conditions above are met, sys_umcg_ctl() immediately returns
> + * if the registered task is a server; a worker will be added to
> + * idle_workers_ptr, and the worker put to sleep; an idle server
> + * from idle_server_tid_ptr will be woken, if present.
> + *
> + * @flags == UMCG_CTL_UNREGISTER: unregister a UMCG task. If the current task
> + * is a UMCG worker, the userspace is responsible for waking its
> + * server (before or after calling sys_umcg_ctl).
> + *
> + * Return:
> + * 0 - success
> + * -EFAULT - failed to read @self
> + * -EINVAL - some other error occurred
> + */
> +SYSCALL_DEFINE2(umcg_ctl, u32, flags, struct umcg_task __user *, self)
> +{
> + struct umcg_task ut;
> +
> + if (flags == UMCG_CTL_UNREGISTER) {
> + if (self || !current->umcg_task)
> + return -EINVAL;
> +
> + if (current->flags & PF_UMCG_WORKER)
> + umcg_handle_exiting_worker();
> + else
> + umcg_clear_task(current);
> +
> + return 0;
> + }
> +
> + /* Register the current task as a UMCG task. */
> + if (!(flags & UMCG_CTL_REGISTER))
> + return -EINVAL;
> +
> + flags &= ~UMCG_CTL_REGISTER;
> + if (flags && flags != UMCG_CTL_WORKER)
> + return -EINVAL;
> +
> + if (current->umcg_task || !self)
> + return -EINVAL;
> +
> + if (copy_from_user(&ut, self, sizeof(ut)))
> + return -EFAULT;
> +
> + if (ut.next_tid)
> + return -EINVAL;
> +
> + if ((ut.state_ts & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_RUNNING)
> + return -EINVAL;
> +
> + WRITE_ONCE(current->umcg_task, self);
> +
> + if (flags == UMCG_CTL_WORKER) {
> + current->flags |= PF_UMCG_WORKER;
> +
> + /* Trigger umcg_handle_resuming_worker() */
> + set_tsk_thread_flag(current, TIF_NOTIFY_RESUME);
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * handle_timedout_worker - make sure the worker is added to idle_workers
> + * upon a "clean" timeout.
> + */
> +static int handle_timedout_worker(struct umcg_task __user *self)
> +{
> + u64 curr_state, next_state;
> + int ret;
> +
> + if (get_user(curr_state, &self->state_ts))
> + return -EFAULT;
> +
> + if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE) {
> + /* TODO: should we care here about TF_LOCKED or TF_PREEMPTED? */
> +
> + next_state = curr_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> +
> + ret = umcg_update_state(&self->state_ts, &curr_state, next_state, true);
> + if (ret)
> + return ret;
> +
> + return -ETIMEDOUT;
> + }
> +
> + return 0; /* Not really timed out. */
> +}
> +
> +/**
> + * umcg_idle_loop - sleep until the current task becomes RUNNING or a timeout
> + * @abs_timeout - absolute timeout in nanoseconds; zero => no timeout
> + *
> + * The function marks the current task as INTERRUPTIBLE and calls
> + * freezable_schedule(). It returns when either the timeout expires or
Last version use schedule() here. And there are comment there and I don't
know freeable enough.
> + * the UMCG state of the task becomes RUNNING.
> + *
> + * Note: because UMCG workers should not be running WITHOUT attached servers,
> + * and because servers should not be running WITH attached workers,
> + * the function returns only on fatal signal pending and ignores/flushes
> + * all other signals.
> + */
> +static int umcg_idle_loop(u64 abs_timeout)
> +{
> + int ret;
> + struct page *pinned_page = NULL;
> + struct hrtimer_sleeper timeout;
> + struct umcg_task __user *self = current->umcg_task;
> +
> + if (abs_timeout) {
> + hrtimer_init_sleeper_on_stack(&timeout, CLOCK_REALTIME,
> + HRTIMER_MODE_ABS);
> +
> + hrtimer_set_expires_range_ns(&timeout.timer, (s64)abs_timeout,
> + current->timer_slack_ns);
> + }
> +
> + while (true) {
> + u64 umcg_state;
> +
> + /*
> + * We need to read from userspace _after_ the task is marked
> + * TASK_INTERRUPTIBLE, to properly handle concurrent wakeups;
> + * but faulting is not allowed; so we try a fast no-fault read,
> + * and if it fails, pin the page temporarily.
> + */
> +retry_once:
> + set_current_state(TASK_INTERRUPTIBLE);
> +
> + /* Order set_current_state above with get_user_nofault below. */
> + smp_mb();
> + ret = -EFAULT;
> + if (get_user_nofault(umcg_state, &self->state_ts)) {
> + set_current_state(TASK_RUNNING);
> +
> + if (pinned_page)
> + goto out;
> + else if (1 != pin_user_pages_fast((unsigned long)self,
> + 1, 0, &pinned_page))
> + goto out;
> +
> + goto retry_once;
> + }
> +
> + if (pinned_page) {
> + unpin_user_page(pinned_page);
> + pinned_page = NULL;
> + }
> +
> + ret = 0;
> + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + set_current_state(TASK_RUNNING);
> + goto out;
> + }
> +
> + if (abs_timeout)
> + hrtimer_sleeper_start_expires(&timeout, HRTIMER_MODE_ABS);
> +
> + if (!abs_timeout || timeout.task) {
> + /* Clear PF_UMCG_WORKER to elide workqueue handlers. */
> + const bool worker = current->flags & PF_UMCG_WORKER;
> +
> + if (worker)
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + freezable_schedule();
> +
> + if (worker)
> + current->flags |= PF_UMCG_WORKER;
> + }
> + __set_current_state(TASK_RUNNING);
> +
> + /*
> + * Check for timeout before checking the state, as workers
> + * are not going to return from schedule() unless
^^^^^^^^^^
freezable_schedule()
Because you changed to use freezable_schedule() in this verison and
the comment in last version is lost.
Just a little note here.
> + * they are RUNNING.
> + */
> + ret = -ETIMEDOUT;
> + if (abs_timeout && !timeout.task)
> + goto out;
> +
> + ret = -EFAULT;
> + if (get_user(umcg_state, &self->state_ts))
> + goto out;
> +
> + ret = 0;
> + if ((umcg_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING)
> + goto out;
> +
> + ret = -EINTR;
> + if (fatal_signal_pending(current))
> + goto out;
> +
> + if (signal_pending(current))
> + flush_signals(current);
> + }
> +
> +out:
> + if (pinned_page) {
> + unpin_user_page(pinned_page);
> + pinned_page = NULL;
> + }
> +
> + if (abs_timeout) {
> + hrtimer_cancel(&timeout.timer);
> + destroy_hrtimer_on_stack(&timeout.timer);
> + }
> +
> + /* Workers must go through workqueue handlers upon wakeup. */
> + if (current->flags & PF_UMCG_WORKER) {
> + if (ret == -ETIMEDOUT)
> + ret = handle_timedout_worker(self);
> +
> + set_tsk_need_resched(current);
> + }
> +
> + return ret;
> +}
> +
> +/**
> + * umcg_wakeup_allowed - check whether @current can wake @tsk.
> + *
> + * Currently a placeholder that allows wakeups within a single process
> + * only (same mm). In the future the requirement will be relaxed (securely).
> + */
> +static bool umcg_wakeup_allowed(struct task_struct *tsk)
> +{
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + if (tsk->mm && tsk->mm == current->mm && READ_ONCE(tsk->umcg_task))
> + return true;
> +
> + return false;
> +}
> +
> +/*
> + * Try to wake up. May be called with preempt_disable set. May be called
> + * cross-process.
> + *
> + * Note: umcg_ttwu succeeds even if ttwu fails: see wait/wake state
> + * ordering logic.
> + */
> +static int umcg_ttwu(u32 next_tid, int wake_flags)
> +{
> + struct task_struct *next;
> +
> + rcu_read_lock();
> + next = find_task_by_vpid(next_tid);
> + if (!next || !umcg_wakeup_allowed(next)) {
> + rcu_read_unlock();
> + return -ESRCH;
> + }
> +
> + /* The result of ttwu below is ignored. */
> + try_to_wake_up(next, TASK_NORMAL, wake_flags);
> + rcu_read_unlock();
> +
> + return 0;
> +}
> +
> +/*
> + * At the moment, umcg_do_context_switch simply wakes up @next with
> + * WF_CURRENT_CPU and puts the current task to sleep.
> + *
> + * In the future an optimization will be added to adjust runtime accounting
> + * so that from the kernel scheduling perspective the two tasks are
> + * essentially treated as one. In addition, the context switch may be performed
> + * right here on the fast path, instead of going through the wake/wait pair.
> + */
> +static int umcg_do_context_switch(u32 next_tid, u64 abs_timeout)
> +{
> + int ret;
> +
> + ret = umcg_ttwu(next_tid, WF_CURRENT_CPU);
> + if (ret)
> + return ret;
> +
> + return umcg_idle_loop(abs_timeout);
> +}
> +
> +/**
> + * sys_umcg_wait: put the current task to sleep and/or wake another task.
> + * @flags: zero or a value from enum umcg_wait_flag.
> + * @abs_timeout: when to wake the task, in nanoseconds; zero for no timeout.
> + *
> + * @self->state_ts must be UMCG_TASK_IDLE (where @self is current->umcg_task)
> + * if !(@flags & UMCG_WAIT_WAKE_ONLY).
> + *
> + * If @self->next_tid is not zero, it must point to an IDLE UMCG task.
> + * The userspace must have changed its state from IDLE to RUNNING
> + * before calling sys_umcg_wait() in the current task. This "next"
> + * task will be woken (context-switched-to on the fast path) when the
> + * current task is put to sleep.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for detals.
> + *
> + * Return:
> + * 0 - OK;
> + * -ETIMEDOUT - the timeout expired;
> + * -EFAULT - failed accessing struct umcg_task __user of the current
> + * task;
> + * -ESRCH - the task to wake not found or not a UMCG task;
> + * -EINVAL - another error happened (e.g. bad @flags, or the current
> + * task is not a UMCG task, etc.)
> + */
> +SYSCALL_DEFINE2(umcg_wait, u32, flags, u64, abs_timeout)
> +{
> + struct umcg_task __user *self = current->umcg_task;
> + u32 next_tid;
> +
> + if (!self)
> + return -EINVAL;
> +
> + if (get_user(next_tid, &self->next_tid))
> + return -EFAULT;
> +
> + if (flags & UMCG_WAIT_WAKE_ONLY) {
> + if (!next_tid || abs_timeout)
> + return -EINVAL;
> +
> + flags &= ~UMCG_WAIT_WAKE_ONLY;
> + if (flags & ~UMCG_WAIT_WF_CURRENT_CPU)
> + return -EINVAL;
> +
> + return umcg_ttwu(next_tid, flags & UMCG_WAIT_WF_CURRENT_CPU ?
> + WF_CURRENT_CPU : 0);
> + }
> +
> + /* Unlock the worker, if locked. */
> + if (current->flags & PF_UMCG_WORKER) {
> + u64 umcg_state;
> +
> + if (get_user(umcg_state, &self->state_ts))
> + return -EFAULT;
> +
> + if ((umcg_state & UMCG_TF_LOCKED) && umcg_update_state(
> + &self->state_ts, &umcg_state,
> + umcg_state & ~UMCG_TF_LOCKED, true))
> + return -EFAULT;
> + }
> +
> + if (next_tid)
> + return umcg_do_context_switch(next_tid, abs_timeout);
> +
> + return umcg_idle_loop(abs_timeout);
> +}
> +
> +/*
> + * NOTE: all code below is called from workqueue submit/update, or
> + * syscall exit to usermode loop, so all errors result in the
> + * termination of the current task (via SIGKILL).
> + */
> +
> +/*
> + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
> + */
> +static int umcg_wake_idle_server_nofault(u32 server_tid)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + struct task_struct *tsk;
> + int ret = -EINVAL;
> + u64 state;
> +
> + rcu_read_lock();
> +
> + tsk = find_task_by_vpid(server_tid);
> + /* Server/worker interaction is allowed only within the same mm. */
> + if (tsk && current->mm == tsk->mm)
> + ut_server = READ_ONCE(tsk->umcg_task);
> +
> + if (!ut_server)
> + goto out_rcu;
> +
> + ret = -EFAULT;
> + if (get_user_nofault(state, &ut_server->state_ts))
> + goto out_rcu;
> +
> + ret = -EAGAIN;
> + if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE)
> + goto out_rcu;
> +
> + ret = umcg_update_state(&ut_server->state_ts, &state,
> + UMCG_TASK_RUNNING, false);
> +
> + if (ret)
> + goto out_rcu;
> +
> + try_to_wake_up(tsk, TASK_NORMAL, WF_CURRENT_CPU);
> + ret = 0;
> +
> +out_rcu:
> + rcu_read_unlock();
> + return ret;
> +}
> +
> +/*
> + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
> + */
> +static int umcg_wake_idle_server_may_fault(u32 server_tid)
> +{
> + struct umcg_task __user *ut_server = NULL;
> + struct task_struct *tsk;
> + int ret = -EINVAL;
> + u64 state;
> +
> + rcu_read_lock();
> + tsk = find_task_by_vpid(server_tid);
> + if (tsk && current->mm == tsk->mm)
> + ut_server = READ_ONCE(tsk->umcg_task);
> + rcu_read_unlock();
> +
> + if (!ut_server)
> + return -EINVAL;
> +
> + if (get_user(state, &ut_server->state_ts))
> + return -EFAULT;
> +
> + if ((state & UMCG_TASK_STATE_MASK_FULL) != UMCG_TASK_IDLE)
> + return -EAGAIN;
> +
> + ret = umcg_update_state(&ut_server->state_ts, &state,
> + UMCG_TASK_RUNNING, true);
> + if (ret)
> + return ret;
> +
> + /*
> + * umcg_ttwu will call find_task_by_vpid again; but we cannot
> + * elide this, as we cannot do get_user() from an rcu-locked
> + * code block.
> + */
> + return umcg_ttwu(server_tid, WF_CURRENT_CPU);
> +}
> +
> +/*
> + * Wake idle server: find the task, change its state IDLE=>RUNNING, ttwu.
> + */
> +static int umcg_wake_idle_server(u32 server_tid, bool may_fault)
> +{
> + int ret = umcg_wake_idle_server_nofault(server_tid);
> +
> + if (!ret)
> + return 0;
> +
> + if (!may_fault || ret != -EFAULT)
> + return ret;
> +
> + return umcg_wake_idle_server_may_fault(server_tid);
> +}
> +
> +/*
> + * Called in sched_submit_work() context for UMCG workers. In the common case,
> + * the worker's state changes RUNNING => BLOCKED, and its server's state
> + * changes IDLE => RUNNING, and the server is ttwu-ed.
> + *
> + * Under some conditions (e.g. the worker is "locked", see
> + * /Documentation/userspace-api/umcg.[txt|rst] for more details), the
> + * function does nothing.
> + *
> + * The function is called with preempt disabled to make sure the retry_once
> + * logic below works correctly.
> + */
> +static void process_sleeping_worker(struct task_struct *tsk, u32 *server_tid)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u64 curr_state, next_state;
> + bool retried = false;
> + u32 tid;
> + int ret;
> +
> + *server_tid = 0;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid UMCG worker."))
> + return;
> +
> + /* If the worker has no server, do nothing. */
> + if (unlikely(!tsk->pinned_umcg_server_page))
> + return;
> +
> + if (get_user_nofault(curr_state, &ut_worker->state_ts))
> + goto die;
> +
> + /*
> + * The userspace is allowed to concurrently change a RUNNING worker's
> + * state only once in a "short" period of time, so we retry state
> + * change at most once. As this retry block is within a
> + * preempt_disable region, "short" is truly short here.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for details.
> + */
> +retry_once:
> + if (curr_state & UMCG_TF_LOCKED)
> + return;
> +
> + if (WARN_ONCE((curr_state & UMCG_TASK_STATE_MASK) != UMCG_TASK_RUNNING,
> + "Unexpected UMCG worker state."))
> + goto die;
> +
> + next_state = curr_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_BLOCKED;
> +
> + ret = umcg_update_state(&ut_worker->state_ts, &curr_state, next_state, false);
> + if (ret == -EAGAIN) {
> + if (retried)
> + goto die;
> +
> + retried = true;
> + goto retry_once;
> + }
> + if (ret)
> + goto die;
> +
> + if (get_user_nofault(tid, &ut_worker->next_tid))
> + goto die;
> +
> + *server_tid = tid;
> + return;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +}
> +
> +/* Called from sched_submit_work(). Must not fault/sleep. */
> +void umcg_wq_worker_sleeping(struct task_struct *tsk)
> +{
> + u32 server_tid;
> +
> + /*
> + * Disable preemption so that retry_once in process_sleeping_worker
> + * works properly.
> + */
> + preempt_disable();
> + process_sleeping_worker(tsk, &server_tid);
> + preempt_enable();
> +
> + if (server_tid) {
> + int ret = umcg_wake_idle_server_nofault(server_tid);
> +
> + if (ret && ret != -EAGAIN)
> + goto die;
> + }
> +
> + goto out;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +out:
> + umcg_unpin_pages();
> +}
> +
> +/**
> + * enqueue_idle_worker - push an idle worker onto idle_workers_ptr list/stack.
> + *
> + * Returns true on success, false on a fatal failure.
> + *
> + * See Documentation/userspace-api/umcg.[txt|rst] for details.
> + */
> +static bool enqueue_idle_worker(struct umcg_task __user *ut_worker)
> +{
> + u64 __user *node = &ut_worker->idle_workers_ptr;
> + u64 __user *head_ptr;
> + u64 first = (u64)node;
> + u64 head;
> +
> + if (get_user(head, node) || !head)
> + return false;
> +
> + head_ptr = (u64 __user *)head;
> +
> + if (put_user(UMCG_IDLE_NODE_PENDING, node))
> + return false;
> +
> + if (xchg_user_64(head_ptr, &first))
> + return false;
> +
> + if (put_user(first, node))
> + return false;
> +
> + return true;
> +}
> +
> +/**
> + * get_idle_server - retrieve an idle server, if present.
> + *
> + * Returns true on success, false on a fatal failure.
> + */
> +static bool get_idle_server(struct umcg_task __user *ut_worker, u32 *server_tid)
> +{
> + u64 server_tid_ptr;
> + u32 tid;
> +
> + /* Empty result is OK. */
> + *server_tid = 0;
> +
> + if (get_user(server_tid_ptr, &ut_worker->idle_server_tid_ptr))
> + return false;
> +
> + if (!server_tid_ptr)
> + return false;
> +
> + tid = 0;
> + if (xchg_user_32((u32 __user *)server_tid_ptr, &tid))
> + return false;
> +
> + *server_tid = tid;
> + return true;
> +}
> +
> +/*
> + * Returns true to wait for the userspace to schedule this worker, false
> + * to return to the userspace.
> + *
> + * In the common case, a BLOCKED worker is marked IDLE and enqueued
> + * to idle_workers_ptr list. The idle server is woken (if present).
> + *
> + * If a RUNNING worker is preempted, this function will trigger, in which
> + * case the worker is moved to IDLE state and its server is woken.
> + *
> + * Sets @server_tid to point to the server to be woken if the worker
> + * is going to sleep; sets @server_tid to point to the server assigned
> + * to this RUNNING worker if the worker is to return to the userspace.
> + */
> +static bool process_waking_worker(struct task_struct *tsk, u32 *server_tid)
> +{
> + struct umcg_task __user *ut_worker = tsk->umcg_task;
> + u64 curr_state, next_state;
> +
> + *server_tid = 0;
> +
> + if (WARN_ONCE((tsk != current) || !ut_worker, "Invalid umcg worker"))
> + return false;
> +
> + if (fatal_signal_pending(tsk))
> + return false;
> +
> + if (get_user(curr_state, &ut_worker->state_ts))
> + goto die;
> +
> + if ((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_RUNNING) {
> + u32 tid;
> +
> + /* Wakeup: wait but don't enqueue. */
> + if (curr_state & UMCG_TF_LOCKED)
> + return true;
> +
> + smp_rmb(); /* Order getting state and getting server_tid */
> + if (get_user(tid, &ut_worker->next_tid))
> + goto die;
> +
> + if (tid) {
> + *server_tid = tid;
> +
> + /* pass-through: RUNNING with a server. */
> + if (!(curr_state & UMCG_TF_PREEMPTED))
> + return false;
> + } else if (curr_state & UMCG_TF_PREEMPTED)
> + /* PREEMPTED workers must have servers. */
> + goto die;
> +
> + /*
> + * Fallthrough to mark the worker IDLE: the worker is
> + * PREEMPTED, or the worker is RUNNING, but has no server
> + * (which happens via UMCG_WAIT_WAKE_ONLY).
> + */
> + } else if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE &&
> + (curr_state & UMCG_TF_LOCKED)))
> + /* The worker prepares to sleep or to unregister. */
> + return false;
> +
> + if (unlikely((curr_state & UMCG_TASK_STATE_MASK) == UMCG_TASK_IDLE))
> + goto die;
> +
> + next_state = curr_state & ~UMCG_TASK_STATE_MASK;
> + next_state |= UMCG_TASK_IDLE;
> +
> + if (umcg_update_state(&ut_worker->state_ts, &curr_state,
> + next_state, true))
> + goto die;
> +
> + if (!enqueue_idle_worker(ut_worker))
> + goto die;
> +
> + smp_mb(); /* Order enqueuing the worker with getting the server. */
> + if (!(*server_tid) && !get_idle_server(ut_worker, server_tid))
> + goto die;
> +
> + return true;
> +
> +die:
> + pr_warn("umcg_process_waking_worker: killing task %d\n", current->pid);
> + force_sig(SIGKILL);
> + return false;
> +}
> +
> +/*
> + * Called from sched_update_worker(): defer all work until later, as
> + * sched_update_worker() may be called with in-kernel locks held.
> + */
> +void umcg_wq_worker_running(struct task_struct *tsk)
> +{
> + set_tsk_thread_flag(tsk, TIF_NOTIFY_RESUME);
> +}
> +
> +/* Called via TIF_NOTIFY_RESUME flag from exit_to_user_mode_loop. */
> +void umcg_handle_resuming_worker(void)
> +{
> + u32 server_tid;
> +
> + /* Avoid recursion by removing PF_UMCG_WORKER */
> + current->flags &= ~PF_UMCG_WORKER;
> +
> + do {
> + bool should_wait;
> +
> + should_wait = process_waking_worker(current, &server_tid);
> +
> + if (!should_wait)
> + break;
> +
> + if (server_tid) {
> + int ret = umcg_wake_idle_server(server_tid, true);
> +
> + if (ret && ret != -EAGAIN)
> + goto die;
> + }
> +
> + umcg_idle_loop(0);
> + } while (true);
> +
> + if (!server_tid)
> + /* No server => no reason to pin pages. */
> + umcg_unpin_pages();
> + else if (umcg_pin_pages(server_tid))
> + goto die;
> +
> + goto out;
> +
> +die:
> + pr_warn("%s: killing task %d\n", __func__, current->pid);
> + force_sig(SIGKILL);
> +out:
> + current->flags |= PF_UMCG_WORKER;
> +}
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index f43d89d92860..682261d78ee7 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -272,6 +272,10 @@ COND_SYSCALL(landlock_create_ruleset);
> COND_SYSCALL(landlock_add_rule);
> COND_SYSCALL(landlock_restrict_self);
>
> +/* kernel/sched/umcg.c */
> +COND_SYSCALL(umcg_ctl);
> +COND_SYSCALL(umcg_wait);
> +
> /* arch/example/kernel/sys_example.c */
>
> /* mm/fadvise.c */
> --
> 2.25.1
>
Thanks,
Tao
Powered by blists - more mailing lists