[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5494e934-9024-4c39-831b-75ec938161a1@arm.com>
Date: Thu, 18 Sep 2025 16:46:57 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org
Cc: John Stultz <jstultz@...gle.com>,
"Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org
Subject: Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck
On 18.09.25 08:56, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/urgent branch of tip:
>
> Commit-ID: 077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Gitweb: https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Author: Peter Zijlstra <peterz@...radead.org>
> AuthorDate: Tue, 16 Sep 2025 23:02:41 +02:00
> Committer: Peter Zijlstra <peterz@...radead.org>
> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
>
> sched/deadline: Fix dl_server getting stuck
>
> John found it was easy to hit lockup warnings when running locktorture
> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
> ("sched/deadline: Less agressive dl_server handling").
>
> While debugging it seems there is a chance where we end up with the
> dl_server dequeued, with dl_se->dl_server_active. This causes
> dl_server_start() to return without enqueueing the dl_server, thus it
> fails to run when RT tasks starve the cpu.
>
> When this happens, dl_server_timer() catches the
> '!dl_se->server_has_tasks(dl_se)' case, which then calls
> replenish_dl_entity() and dl_server_stopped() and finally return
> HRTIMER_NO_RESTART.
>
> This ends in no new timer and also no enqueue, leaving the dl_server
> 'dead', allowing starvation.
>
> What should have happened is for the bandwidth timer to start the
> zero-laxity timer, which in turn would enqueue the dl_server and cause
> dl_se->server_pick_task() to be called -- which will stop the
> dl_server if no fair tasks are observed for a whole period.
>
> IOW, it is totally irrelevant if there are fair tasks at the moment of
> bandwidth refresh.
>
> This removes all dl_se->server_has_tasks() users, so remove the whole
> thing.
I see the same results like John running his locktorture test, the
'BUG: workqueue lockup' is gone now.
Just got confused because of these two remaining dl_server_has_tasks references:
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73c7de26fa60..73d750292446 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -634,7 +634,6 @@ struct sched_rt_entity {
#endif
} __randomize_layout;
-typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
struct sched_dl_entity {
@@ -728,9 +727,6 @@ struct sched_dl_entity {
* dl_server_update().
*
* @rq the runqueue this server is for
- *
- * @server_has_tasks() returns true if @server_pick return a
- * runnable task.
*/
struct rq *rq;
dl_server_pick_f server_pick_task;
Can you still tweak the patch to get rif of them with the patch?
[...]
Powered by blists - more mailing lists