linux-kernel - Re: [tip: sched/urgent] sched/deadline: Fix dl

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <5494e934-9024-4c39-831b-75ec938161a1@arm.com>
Date: Thu, 18 Sep 2025 16:46:57 +0200
From: Dietmar Eggemann <dietmar.eggemann@....com>
To: linux-kernel@...r.kernel.org, linux-tip-commits@...r.kernel.org
Cc: John Stultz <jstultz@...gle.com>,
 "Peter Zijlstra (Intel)" <peterz@...radead.org>, x86@...nel.org
Subject: Re: [tip: sched/urgent] sched/deadline: Fix dl_server getting stuck



On 18.09.25 08:56, tip-bot2 for Peter Zijlstra wrote:
> The following commit has been merged into the sched/urgent branch of tip:
> 
> Commit-ID:     077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Gitweb:        https://git.kernel.org/tip/077e1e2e0015e5ba6538d1c5299fb299a3a92d60
> Author:        Peter Zijlstra <peterz@...radead.org>
> AuthorDate:    Tue, 16 Sep 2025 23:02:41 +02:00
> Committer:     Peter Zijlstra <peterz@...radead.org>
> CommitterDate: Thu, 18 Sep 2025 08:50:05 +02:00
> 
> sched/deadline: Fix dl_server getting stuck
> 
> John found it was easy to hit lockup warnings when running locktorture
> on a 2 CPU VM, which he bisected down to: commit cccb45d7c429
> ("sched/deadline: Less agressive dl_server handling").
> 
> While debugging it seems there is a chance where we end up with the
> dl_server dequeued, with dl_se->dl_server_active. This causes
> dl_server_start() to return without enqueueing the dl_server, thus it
> fails to run when RT tasks starve the cpu.
> 
> When this happens, dl_server_timer() catches the
> '!dl_se->server_has_tasks(dl_se)' case, which then calls
> replenish_dl_entity() and dl_server_stopped() and finally return
> HRTIMER_NO_RESTART.
> 
> This ends in no new timer and also no enqueue, leaving the dl_server
> 'dead', allowing starvation.
> 
> What should have happened is for the bandwidth timer to start the
> zero-laxity timer, which in turn would enqueue the dl_server and cause
> dl_se->server_pick_task() to be called -- which will stop the
> dl_server if no fair tasks are observed for a whole period.
> 
> IOW, it is totally irrelevant if there are fair tasks at the moment of
> bandwidth refresh.
> 
> This removes all dl_se->server_has_tasks() users, so remove the whole
> thing.

I see the same results like John running his locktorture test, the
'BUG: workqueue lockup' is gone now.

Just got confused because of these two remaining dl_server_has_tasks references:

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 73c7de26fa60..73d750292446 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -634,7 +634,6 @@ struct sched_rt_entity {
 #endif
 } __randomize_layout;
 
-typedef bool (*dl_server_has_tasks_f)(struct sched_dl_entity *);
 typedef struct task_struct *(*dl_server_pick_f)(struct sched_dl_entity *);
 
 struct sched_dl_entity {
@@ -728,9 +727,6 @@ struct sched_dl_entity {
         * dl_server_update().
         *
         * @rq the runqueue this server is for
-        *
-        * @server_has_tasks() returns true if @server_pick return a
-        * runnable task.
         */
        struct rq                       *rq;
        dl_server_pick_f                server_pick_task;

Can you still tweak the patch to get rif of them with the patch?

[...]