linux-kernel - Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <97c8a989-08b1-4233-8f5a-cb8b582b6c02@amd.com>
Date: Fri, 1 Aug 2025 15:26:36 +0530
From: K Prateek Nayak <kprateek.nayak@....com>
To: Nam Cao <namcao@...utronix.de>
CC: Gabriele Monaco <gmonaco@...hat.com>, Steven Rostedt
	<rostedt@...dmis.org>, Masami Hiramatsu <mhiramat@...nel.org>, Mathieu
 Desnoyers <mathieu.desnoyers@...icios.com>,
	<linux-trace-kernel@...r.kernel.org>, <linux-kernel@...r.kernel.org>, Ingo
 Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>, Juri Lelli
	<juri.lelli@...hat.com>, Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>, Ben Segall <bsegall@...gle.com>,
	Mel Gorman <mgorman@...e.de>, Valentin Schneider <vschneid@...hat.com>
Subject: Re: [PATCH 4/5] sched: Add rt task enqueue/dequeue trace points

Hello Nam,

On 8/1/2025 12:59 PM, Nam Cao wrote:
> On Fri, Aug 01, 2025 at 09:12:08AM +0530, K Prateek Nayak wrote:
>> Just thinking out loud, putting this tracepoint here can lead to a
>> "dequeued -> dequeued" transition for fair task when they are in delayed
>> dequeue state.
>>
>>     dequeue_task(p)
>>       trace_dequeue_task_tp(p) # First time
>>       dequeue_task_fair(p)
>>         p->se.delayed = 1
>>     ...
>>     <sched_switch> # p is still delayed
>>     ...
>>     sched_setscheduler(p)
>>       if (prev_class != next_class && p->se.sched_delayed)
>>         dequeue_task(p, DEQUEUE_DELAYED);
>>           trace_dequeue_task_tp(p) # Second time
>>
>> It is not an issue as such but it might come as a surprise if users are
>> expecting a behavior like below which would be the case for !fair task
>> currently (and for all tasks before v6.12):
>>
>>     digraph state_automaton {
>>         center = true;
>>         size = "7,11";
>>         {node [shape = plaintext, style=invis, label=""] "__init_enqueue_dequeue_cycle"};
>>         {node [shape = ellipse] "enqueued"};
>>         {node [shape = ellipse] "dequeued"};
>>         "__init_enqueue_dequeue_cycle" -> "enqueued";
>>         "__init_enqueue_dequeue_cycle" -> "dequeued";
>>         "enqueued" [label = "enqueued", color = green3];
>>         "enqueued" -> "dequeued" [ label = "dequeue_task" ];
>>         "dequeued" [label = "dequeued", color = red];
>>         "dequeued" -> "enqueued" [ label = "enqueue_task" ];
>>         { rank = min ;
>>             "__init_enqueue_dequeue_cycle";
>>             "dequeued";
>>             "enqueued";
>>         }
>>     }
>>
>>
>> Another:
>>
>>    "dequeued" -> "dequeued" [ label = "dequeue_task" ];
>>
>> edge would be needed in that case for >= v6.12. It is probably nothing
>> and can be easily handled by the users if they run into it but just
>> putting it out there for the record in case you only want to consider a
>> complete dequeue as "dequeued". Feel free to ignore since I'm completely
>> out of my depth when it comes to the usage of RV in the field :)
> 
> Ah, thanks for pointing this out. I do want to only consider complete
> dequeue as "dequeued".
> 
> These tracepoints are not visible from userspace, and RV does not care
> about enqueue/dequeue of fair tasks at the moment, so it is not a problem
> for now. But as a precaution, I trust the below patch will do.

There are a few more cases with delayed dequeue:

1. A delayed task can be reenqueued before it is completely dequeued and
   will lead to a enqueue -> enqueue transition if we don't trace the
   first dequeue.

2. There are cases in set_user_nice() and __sched_setscheduler() where
   a delayed task is dequeued in delayed state and be put back in the
   cfs_rq in delayed state - an attempt to handle 1. can trip this.

Best I could think of is the below diff on top of your suggestion where
a "delayed -> reenqueue" is skipped but the case 2. is captures in case
one needs to inspect some properties from set_user_nice() /
__sched_setscheduler():

(only build tested on top of the diff you had pasted)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9598984bee8d..1fc5a97bba6b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2071,7 +2071,8 @@ unsigned long get_wchan(struct task_struct *p)
 
 void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
 {
-	trace_enqueue_task_tp(rq->cpu, p);
+	if (!p->se.sched_delayed || !move_entity(flags))
+		trace_enqueue_task_tp(rq->cpu, p);
 
 	if (!(flags & ENQUEUE_NOCLOCK))
 		update_rq_clock(rq);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b173a059315c..1e2a636d6804 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5444,7 +5444,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
 	 * put back on, and if we advance min_vruntime, we'll be placed back
 	 * further than we started -- i.e. we'll be penalized.
 	 */
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) != DEQUEUE_SAVE)
+	if (move_entity(flags))
 		update_min_vruntime(cfs_rq);
 
 	if (flags & DEQUEUE_DELAYED)
@@ -7054,6 +7054,7 @@ static int dequeue_entities(struct rq *rq, struct sched_entity *se, int flags)
 
 		/* Fix-up what dequeue_task_fair() skipped */
 		hrtick_update(rq);
+		trace_dequeue_task_tp(rq->cpu, p);
 
 		/*
 		 * Fix-up what block_task() skipped.
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 7936d4333731..33897d35744a 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1196,19 +1196,6 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
 	dec_rt_group(rt_se, rt_rq);
 }
 
-/*
- * Change rt_se->run_list location unless SAVE && !MOVE
- *
- * assumes ENQUEUE/DEQUEUE flags match
- */
-static inline bool move_entity(unsigned int flags)
-{
-	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
-		return false;
-
-	return true;
-}
-
 static void __delist_rt_entity(struct sched_rt_entity *rt_se, struct rt_prio_array *array)
 {
 	list_del_init(&rt_se->run_list);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index d3f33d10c58c..37730cd834ba 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2361,6 +2361,20 @@ extern const u32		sched_prio_to_wmult[40];
 
 #define RETRY_TASK		((void *)-1UL)
 
+/*
+ * Checks for a SAVE/RESTORE without MOVE. Returns false if
+ * SAVE and !MOVE.
+ *
+ * Assumes ENQUEUE/DEQUEUE flags match.
+ */
+static inline bool move_entity(unsigned int flags)
+{
+	if ((flags & (DEQUEUE_SAVE | DEQUEUE_MOVE)) == DEQUEUE_SAVE)
+		return false;
+
+	return true;
+}
+
 struct affinity_context {
 	const struct cpumask	*new_mask;
 	struct cpumask		*user_mask;
---

Thoughts?

-- 
Thanks and Regards,
Prateek

P.S. move_entity() probably requires a better naming and perhaps can
even be simplified. I wrote out the below table just to convince myself
to reuse move_entity() in fair.c

    flags contains    (SAVE | MOVE)   (SAVE | MOVE)
                         == SAVE         != SAVE

    neiter                false           true
    SAVE                  true            false
    MOVE                  false           true
    SAVE | MOVE           false           true

> 
> Nam
> 
> diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> index c38f12f7f903..b50668052f99 100644
> --- a/include/trace/events/sched.h
> +++ b/include/trace/events/sched.h
> @@ -906,6 +906,14 @@ DECLARE_TRACE(dequeue_task_rt,
>  	TP_PROTO(int cpu, struct task_struct *task),
>  	TP_ARGS(cpu, task));
>  
> +DECLARE_TRACE(enqueue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
> +DECLARE_TRACE(dequeue_task,
> +	TP_PROTO(int cpu, struct task_struct *task),
> +	TP_ARGS(cpu, task));
> +
>  #endif /* _TRACE_SCHED_H */
>  
>  /* This part must be outside protection */
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index b485e0639616..553c08a63395 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2077,6 +2077,8 @@ unsigned long get_wchan(struct task_struct *p)
>  
>  void enqueue_task(struct rq *rq, struct task_struct *p, int flags)
>  {
> +	trace_enqueue_task_tp(rq->cpu, p);
> +
>  	if (!(flags & ENQUEUE_NOCLOCK))
>  		update_rq_clock(rq);
>  
> @@ -2119,7 +2121,11 @@ inline bool dequeue_task(struct rq *rq, struct task_struct *p, int flags)
>  	 * and mark the task ->sched_delayed.
>  	 */
>  	uclamp_rq_dec(rq, p);
> -	return p->sched_class->dequeue_task(rq, p, flags);
> +	if (p->sched_class->dequeue_task(rq, p, flags)) {
> +		trace_dequeue_task_tp(rq->cpu, p);
> +		return true;
> +	}
> +	return false;
>  }
>  
>  void activate_task(struct rq *rq, struct task_struct *p, int flags)