linux-kernel - Re: [RFC PATCH v3 12/24] sched/rt: Update task event callbacks for HCBS scheduling

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aOdcIbsqwhJTdGjL@jlelli-thinkpadt14gen4.remote.csb>
Date: Thu, 9 Oct 2025 08:54:25 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Yuri Andriaccio <yurand2000@...il.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>,
	linux-kernel@...r.kernel.org,
	Luca Abeni <luca.abeni@...tannapisa.it>,
	Yuri Andriaccio <yuri.andriaccio@...tannapisa.it>
Subject: Re: [RFC PATCH v3 12/24] sched/rt: Update task event callbacks for
 HCBS scheduling

Hello,

On 29/09/25 11:22, Yuri Andriaccio wrote:
> Update wakeup_preempt_rt, switched_{from/to}_rt and prio_changed_rt with
> rt-cgroup's specific preemption rules.
> Add checks whether a rt-task can be attached or not to a rt-cgroup.
> Update task_is_throttled_rt for SCHED_CORE.

A little dry. :) This is telling the what, can you please add also the
why and how?

> Co-developed-by: Alessio Balsini <a.balsini@...up.it>
> Signed-off-by: Alessio Balsini <a.balsini@...up.it>
> Co-developed-by: Andrea Parri <parri.andrea@...il.com>
> Signed-off-by: Andrea Parri <parri.andrea@...il.com>
> Co-developed-by: luca abeni <luca.abeni@...tannapisa.it>
> Signed-off-by: luca abeni <luca.abeni@...tannapisa.it>
> Signed-off-by: Yuri Andriaccio <yurand2000@...il.com>
> ---
>  kernel/sched/core.c     |  2 +-
>  kernel/sched/rt.c       | 88 ++++++++++++++++++++++++++++++++++++++---
>  kernel/sched/syscalls.c | 13 ++++++
>  3 files changed, 96 insertions(+), 7 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e5b4facee24..2cfbe3b7b17 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -9346,7 +9346,7 @@ static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
>  		goto scx_check;
>  
>  	cgroup_taskset_for_each(task, css, tset) {
> -		if (!sched_rt_can_attach(css_tg(css), task))
> +		if (rt_task(task) && !sched_rt_can_attach(css_tg(css), task))
>  			return -EINVAL;
>  	}
>  scx_check:
> diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
> index d9442f64c6b..ce114823fe7 100644
> --- a/kernel/sched/rt.c
> +++ b/kernel/sched/rt.c
> @@ -946,6 +946,50 @@ static void wakeup_preempt_rt(struct rq *rq, struct task_struct *p, int flags)
>  {
>  	struct task_struct *donor = rq->donor;
>  
> +	if (!rt_group_sched_enabled())

Should we actually be using rt_group_sched_enabled() (instead of
ifdeffery) everywhere to check if group scheduling is enabled and
differentiate behavior?

> +		goto no_group_sched;
> +
> +	/*
> +	 * Preemption checks are different if the waking task and the current
> +	 * task are running on the global runqueue or in a cgroup.
> +	 * The following rules apply:
> +	 *   - dl-tasks (and equally dl_servers) always preempt FIFO/RR tasks.

We already touched upon this and maybe we can leave it for later, but
the fact that dl_servers always preempt FIFO/RR is going to be a change
of behavior wrt legacy RT group scheduling. Legacy users might have
expectations that setting an high static priority means those tasks will
always be scheduled first and this will break that. Maybe we want to
convince them to change their assumptions, but we need to at least
cleary document the new behavior (and why it is right/better - if it is
indeed :). Not sure if we can also find a way to be "back-compatible"
and if we want to do that.

> +	 *     - if curr is inside a cgroup (i.e. run by a dl_server) and
> +	 *       waking is not, do nothing.
> +	 *     - if waking is inside a cgroup but not curr, always reschedule.
> +	 *   - if they are both on the global runqueue, run the standard code.
> +	 *   - if they are both in the same cgroup, check for tasks priorities.
> +	 *   - if they are both in a cgroup, but not the same one, check whether
> +	 *     the woken task's dl_server preempts the current's dl_server.
> +	 */
> +	if (is_dl_group(rt_rq_of_se(&p->rt)) &&
> +	    is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
> +		struct sched_dl_entity *woken_dl_se, *curr_dl_se;
> +
> +		woken_dl_se = dl_group_of(rt_rq_of_se(&p->rt));
> +		curr_dl_se = dl_group_of(rt_rq_of_se(&rq->curr->rt));
> +
> +		if (rt_rq_of_se(&p->rt)->tg == rt_rq_of_se(&rq->curr->rt)->tg) {

What about checking if woken_dl_se and curr_dl_se are the same dl_se?

> +			if (p->prio < rq->curr->prio)
> +				resched_curr(rq);
> +
> +			return;
> +		}
> +
> +		if (dl_entity_preempt(woken_dl_se, curr_dl_se))
> +			resched_curr(rq);
> +
> +		return;
> +
> +	} else if (is_dl_group(rt_rq_of_se(&p->rt))) {
> +		resched_curr(rq);
> +		return;
> +
> +	} else if (is_dl_group(rt_rq_of_se(&rq->curr->rt))) {
> +		return;
> +	}
> +
> +no_group_sched:
>  	if (p->prio < donor->prio) {
>  		resched_curr(rq);
>  		return;

...

> @@ -1750,8 +1797,17 @@ static void switched_to_rt(struct rq *rq, struct task_struct *p)
>  	 * then see if we can move to another run queue.
>  	 */
>  	if (task_on_rq_queued(p)) {
> +
> +#ifndef CONFIG_RT_GROUP_SCHED
>  		if (p->nr_cpus_allowed > 1 && rq->rt.overloaded)
>  			rt_queue_push_tasks(rt_rq_of_se(&p->rt));
> +#else
> +		if (rt_rq_of_se(&p->rt)->overloaded) {

Is this empty because intra-group migration is coming with some future
change? Even if that's the case, I believe this wants a comment
explaining why this branch is empty.

> +		} else {
> +			if (p->prio < rq->curr->prio)
> +				resched_curr(rq);
> +		}
> +#endif
>  		if (p->prio < rq->donor->prio && cpu_online(cpu_of(rq)))
>  			resched_curr(rq);
>  	}

...

> @@ -1876,7 +1943,16 @@ static unsigned int get_rr_interval_rt(struct rq *rq, struct task_struct *task)
>  #ifdef CONFIG_SCHED_CORE
>  static int task_is_throttled_rt(struct task_struct *p, int cpu)
>  {
> +#ifdef CONFIG_RT_GROUP_SCHED
> +	struct rt_rq *rt_rq;
> +
> +	rt_rq = task_group(p)->rt_rq[cpu];
> +	WARN_ON(!rt_group_sched_enabled() && rt_rq->tg != &root_task_group);
> +
> +	return dl_group_of(rt_rq)->dl_throttled;
> +#else
>  	return 0;
> +#endif
>  }
>  #endif /* CONFIG_SCHED_CORE */
>  
> @@ -2131,7 +2207,7 @@ static int sched_rt_global_constraints(void)
>  int sched_rt_can_attach(struct task_group *tg, struct task_struct *tsk)

tsk argument is not going to be used anymore with this, please remove
it. Also maybe rename to something closer to what the function actually
checks for, e.g. [sched_]rt_group_has_runtime.

>  {
>  	/* Don't accept real-time tasks when there is no way for them to run */
> -	if (rt_group_sched_enabled() && rt_task(tsk) && tg->rt_bandwidth.rt_runtime == 0)
> +	if (rt_group_sched_enabled() && tg->dl_bandwidth.dl_runtime == 0)
>  		return 0;

Thanks,
Juri