linux-kernel - [RFC PATCH v4 13/28] sched/rt: Update task event callbacks for HCBS scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20251201124205.11169-14-yurand2000@gmail.com>
Date: Mon,  1 Dec 2025 13:41:46 +0100
From: Yuri Andriaccio <yurand2000@...il.com>
To: Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Juri Lelli <juri.lelli@...hat.com>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>
Cc: linux-kernel@...r.kernel.org,
	Luca Abeni <luca.abeni@...tannapisa.it>,
	Yuri Andriaccio <yuri.andriaccio@...tannapisa.it>
Subject: [RFC PATCH v4 13/28] sched/rt: Update task event callbacks for HCBS scheduling

Update wakeup_preempt_rt, switched_{from/to}_rt and prio_changed_rt with
rt-cgroup's specific preemption rules:

- In wakeup_preempt_rt(), whenever a task wakes up, it must be checked if
  it is served by a deadline server or it lives on the global runqueue.
  Preemption rules (as documented in the function), change based on the
  current task and woken task runqueue:
  - If both tasks are FIFO/RR tasks on the global runqueue, or the same
    cgroup, run as normal.
  - If woken is inside a cgroup, but curr is a FIFO task on the global
    runqueue, always preempt. If curr is a DEADLINE task, check if the dl
    server preempts curr.
  - If both tasks are FIFO/RR tasks in served but different groups, check
    whether the woken server preempts the current server.
- In switched_from_rt(), perform a pull only on the global runqueue, and
  do nothing if the task is inside a group. This will change when
  migrations will be added.
- In switched_to_rt(), queue a push only on the global runqueue, while
  perform a priority check when the task switching is inside a group.
  This will change also when migrations will be added.
- In prio_changed_rt(), queue a pull only on the global runqueue, if the
  task is not queued. If the task is queued, run preemption checks only
  if both the prio changed task and curr are in the same cgroup.

Update sched_rt_can_attach() to check if a task can be attached to a given
cgroup. For now the check only consists in checking if the group has
non-zero bandwidth. Remove the tsk argument from sched_rt_can_attach, as
it is unused.

Change cpu_cgroup_can_attach() to check if the attachee is a FIFO/RR
task before attaching it to a cgroup.

Update __sched_setscheduler() to perform checks when trying to switch
to FIFO/RR for a task inside a cgroup, as the group needs to have
runtime allocated.

Update task_is_throttled_rt() for SCHED_CORE, returning the is_throttled
value of the server if present, while global rt-tasks are never throttled.

Co-developed-by: Alessio Balsini <a.balsini@...up.it>
Signed-off-by: Alessio Balsini <a.balsini@...up.it>
Co-developed-by: Andrea Parri <parri.andrea@...il.com>
Signed-off-by: Andrea Parri <parri.andrea@...il.com>
Co-developed-by: luca abeni <luca.abeni@...tannapisa.it>
Signed-off-by: luca abeni <luca.abeni@...tannapisa.it>
Signed-off-by: Yuri Andriaccio <yurand2000@...il.com>
---
 kernel/sched/core.c     |   2 +-
 kernel/sched/rt.c       | 104 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h    |   2 +-
 kernel/sched/syscalls.c |  13 +++++
 4 files changed, 108 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 495cbdfdc5..d7fc83cdae 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -9318,7 +9318,7 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
 		goto scx_check;

 	cgroup_taskset_for_each(task, css, tset) {
-		if (!sched_rt_can_attach(css_tg(css), task))
+		if (rt_task(task) && !sched_rt_can_attach(css_tg(css)))
 			return -EINVAL;
 	}
 scx_check:
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7ec117a18d..2b7c4b7754 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -960,7 +960,58 @@ static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
 static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
 {
 	struct task_struct *donor = rq->donor;
+	struct sched_dl_entity *woken_dl_se = NULL;
+	struct sched_dl_entity *curr_dl_se = NULL;

+	if (!rt_group_sched_enabled())
+		goto no_group_sched;
+
+	/*
+	 * Preemption checks are different if the waking task and the current task
+	 * are running on the global runqueue or in a cgroup. The following rules
+	 * apply:
+	 *   - dl-tasks (and equally dl_servers) always preempt FIFO/RR tasks.
+	 *     - if curr is a FIFO/RR task inside a cgroup (i.e. run by a
+	 *       dl_server), or curr is a DEADLINE task and waking is a FIFO/RR task
+	 *       on the root cgroup, do nothing.
+	 *     - if waking is inside a cgroup but curr is a FIFO/RR task in the root
+	 *       cgroup, always reschedule.
+	 *   - if they are both on the global runqueue, run the standard code.
+	 *   - if they are both in the same cgroup, check for tasks priorities.
+	 *   - if they are both in a cgroup, but not the same one, check whether the
+	 *     woken task's dl_server preempts the current's dl_server.
+	 *   - if curr is a DEADLINE task and waking is in a cgroup, check whether
+	 *     the woken task's server preempts curr.
+	 */
+	if (is_dl_group(rt_rq_of_se(&p->rt)))
+		woken_dl_se = dl_group_of(rt_rq_of_se(&p->rt));
+	if (is_dl_group(rt_rq_of_se(&rq->curr->rt)))
+		curr_dl_se = dl_group_of(rt_rq_of_se(&rq->curr->rt));
+	else if (task_has_dl_policy(rq->curr))
+		curr_dl_se = &rq->curr->dl;
+
+	if (woken_dl_se != NULL && curr_dl_se != NULL) {
+		if (woken_dl_se == curr_dl_se) {
+			if (p->prio < rq->curr->prio)
+				resched_curr(rq);
+
+			return;
+		}
+
+		if (dl_entity_preempt(woken_dl_se, curr_dl_se))
+			resched_curr(rq);
+
+		return;
+
+	} else if (woken_dl_se != NULL) {
+		resched_curr(rq);
+		return;
+
+	} else if (curr_dl_se != NULL) {
+		return;
+	}
+
+no_group_sched:
 	if (p->prio < donor->prio) {
 		resched_curr(rq);
 		return;
@@ -1710,6 +1761,8 @@ static void rq_offline_rt(struct rq *rq)
  */
 static void switched_from_rt(struct rq *rq, struct task_struct *p)
 {
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
 	/*
 	 * If there are other RT tasks then we will reschedule
 	 * and the scheduling of the other RT tasks will handle
@@ -1717,10 +1770,11 @@ static void switched_from_rt(struct rq *rq, struct task_struct *p)
 	 * we may need to handle the pulling of RT tasks
 	 * now.
 	 */
-	if (!task_on_rq_queued(p) || rq->rt.rt_nr_running)
+	if (!task_on_rq_queued(p) || rt_rq->rt_nr_running)
 		return;

-	rt_queue_pull_task(rt_rq_of_se(&p->rt));
+	if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED) || !is_dl_group(rt_rq))
+		rt_queue_pull_task(rt_rq);
 }

 void __init init_sched_rt_class(void)
@@ -1740,6 +1794,8 @@ void __init init_sched_rt_class(void)
  */
 static void switched_to_rt(struct rq *rq, struct task_struct *p)
 {
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
 	/*
 	 * If we are running, update the avg_rt tracking, as the running time
 	 * will now on be accounted into the latter.
@@ -1755,8 +1811,14 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 	 * then see if we can move to another run queue.
 	 */
 	if (task_on_rq_queued(p)) {
-		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
-			rt_queue_push_tasks(rt_rq_of_se(&p->rt));
+		if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) && is_dl_group(rt_rq)) {
+			if (p->prio < rq->curr->prio)
+				resched_curr(rq);
+		} else {
+			if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
+				rt_queue_push_tasks(rt_rq_of_se(&p->rt));
+		}
+
 		if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
 			resched_curr(rq);
 	}
@@ -1769,6 +1831,8 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
 static void
 prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 {
+	struct rt_rq *rt_rq = rt_rq_of_se(&p->rt);
+
 	if (!task_on_rq_queued(p))
 		return;

@@ -1777,16 +1841,25 @@ prio_changed_rt(struct rq *rq, struct task_struct *p, int oldprio)
 		 * If our priority decreases while running, we
 		 * may need to pull tasks to this runqueue.
 		 */
-		if (oldprio < p->prio)
-			rt_queue_pull_task(rt_rq_of_se(&p->rt));
+		if (!IS_ENABLED(CONFIG_RT_GROUP_SCHED) && oldprio < p->prio)
+			rt_queue_pull_task(rt_rq);

 		/*
 		 * If there's a higher priority task waiting to run
 		 * then reschedule.
 		 */
-		if (p->prio > rq->rt.highest_prio.curr)
+		if (p->prio > rt_rq->highest_prio.curr)
 			resched_curr(rq);
 	} else {
+		/*
+		 * This task is not running, thus we check against the currently
+		 * running task for preemption. We can preempt only if both tasks are
+		 * in the same cgroup or on the global runqueue.
+		 */
+		if (IS_ENABLED(CONFIG_RT_GROUP_SCHED) &&
+		    rt_rq_of_se(&p->rt)->tg != rt_rq_of_se(&rq->curr->rt)->tg)
+			return;
+
 		/*
 		 * This task is not running, but if it is
 		 * greater than the current running task
@@ -1881,7 +1954,16 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
 #ifdef CONFIG_SCHED_CORE
 static int task_is_throttled_rt(struct task_struct *p, int cpu)
 {
+#ifdef CONFIG_RT_GROUP_SCHED
+	struct rt_rq *rt_rq;
+
+	rt_rq = task_group(p)->rt_rq[cpu];
+	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
+
+	return dl_group_of(rt_rq)->dl_throttled;
+#else
 	return 0;
+#endif
 }
 #endif /* CONFIG_SCHED_CORE */

@@ -2133,16 +2215,16 @@ static int sched_rt_global_constraints(void)
 }
 #endif /* CONFIG_SYSCTL */

-int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)
+int sched_rt_can_attach(struct task_group *tg)
 {
 	/* Don't accept real-time tasks when there is no way for them to run */
-	if (rt_group_sched_enabled() && rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
+	if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
 		return 0;

 	return 1;
 }

-#else /* !CONFIG_RT_GROUP_SCHED: */
+#else /* !CONFIG_RT_GROUP_SCHED */

 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_constraints(void)
@@ -2150,7 +2232,7 @@ static int sched_rt_global_constraints(void)
 	return 0;
 }
 #endif /* CONFIG_SYSCTL */
-#endif /* !CONFIG_RT_GROUP_SCHED */
+#endif /* CONFIG_RT_GROUP_SCHED */

 #ifdef CONFIG_SYSCTL
 static int sched_rt_global_validate(void)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fb4dcb4551..bc3ed02e40 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -606,7 +606,7 @@ extern int sched_group_set_rt_runtime(struct task_group *tg, long rt_runtime_us)
 extern int sched_group_set_rt_period(struct task_group *tg, u64 rt_period_us);
 extern long sched_group_rt_runtime(struct task_group *tg);
 extern long sched_group_rt_period(struct task_group *tg);
-extern int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk);
+extern int sched_rt_can_attach(struct task_group *tg);

 extern struct task_group *sched_create_group(struct task_group *parent);
 extern void sched_online_group(struct task_group *tg,
diff --git a/kernel/sched/syscalls.c b/kernel/sched/syscalls.c
index 93a9c03b28..0b75126019 100644
--- a/kernel/sched/syscalls.c
+++ b/kernel/sched/syscalls.c
@@ -626,6 +626,19 @@ int __sched_setscheduler(struct task_struct *p,
 change:

 	if (user) {
+#ifdef CONFIG_RT_GROUP_SCHED
+		/*
+		 * Do not allow real-time tasks into groups that have no runtime
+		 * assigned.
+		 */
+		if (rt_group_sched_enabled() &&
+				dl_bandwidth_enabled() && rt_policy(policy) &&
+				!sched_rt_can_attach(task_group(p)) &&
+				!task_group_is_autogroup(task_group(p))) {
+			retval = -EPERM;
+			goto unlock;
+		}
+#endif
 		if (dl_bandwidth_enabled() && dl_policy(policy) &&
 				!(attr->sched_flags & SCHED_FLAG_SUGOV)) {
 			cpumask_t *span = rq->rd->span;
--
2.51.0