lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aXOgY2LGO2-heSSo@jlelli-thinkpadt14gen4.remote.csb>
Date: Fri, 23 Jan 2026 17:22:59 +0100
From: Juri Lelli <juri.lelli@...hat.com>
To: Andrea Righi <arighi@...dia.com>
Cc: Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
	Vincent Guittot <vincent.guittot@...aro.org>,
	Dietmar Eggemann <dietmar.eggemann@....com>,
	Steven Rostedt <rostedt@...dmis.org>,
	Ben Segall <bsegall@...gle.com>, Mel Gorman <mgorman@...e.de>,
	Valentin Schneider <vschneid@...hat.com>, Tejun Heo <tj@...nel.org>,
	Joel Fernandes <joelagnelf@...dia.com>,
	David Vernet <void@...ifault.com>,
	Changwoo Min <changwoo@...lia.com>,
	Daniel Hodges <hodgesd@...a.com>, sched-ext@...ts.linux.dev,
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH v2] sched/deadline: Reset dl_server execution state on
 stop

Hello,

On 23/01/26 17:16, Andrea Righi wrote:
> dl_server_stop() can leave a deadline server in an inconsistent internal
> state across stop/start transitions, causing it to bypass its required
> deferral phase when restarted. This breaks the scheduler invariant that
> a restarted server must re-establish eligibility before being allowed to
> execute.
> 
> When the server is stopped (e.g., because the associated task blocks),
> it's expected to transition back to an inactive, initial state. However,
> dl_server_stop() does not fully reset the execution state. As a result,
> the server can be logically inactive while still appearing as if it was
> still running.
> 
> When the server is restarted via dl_server_start(), the following
> sequence occurs:
>   1. dl_server_start() calls enqueue_dl_entity(ENQUEUE_WAKEUP),
>   2. enqueue_dl_entity() calls update_dl_entity(),
>   3. update_dl_entity() checks (!dl_se->dl_defer_running) to decide
>      whether to arm the deferral mechanism,
>   4. because dl_defer_running is stale, the check fails,
>   5. dl_defer_armed and dl_throttled are not set,
>   6. enqueue_dl_entity() skips start_dl_timer(), because
>      dl_throttled == 0,
>   7. the server is enqueued via __enqueue_dl_entity(),
>   8. the scheduler picks the server to run,
>   9. update_curr_dl_se() detects that the server has exhausted its
>      runtime (or has negative runtime), as it wasn't properly
>      replenished/deferred,
>  10. the server is throttled (dl_throttled set to 1) and dequeued,
>  11. the server repeatedly cycles through wakeup and throttling,
>      effectively receiving no usable CPU bandwidth.
> 
> This results in starvation of the tasks serviced by the deadline server
> in the presence of competing RT workloads.
> 
> This issue can be confirmed adding debugging traces, which show that the
> server skips the deferral timer and is immediately throttled upon
> execution with negative runtime:
> 
>  DEBUG: dl_server_start: dl_defer_running=1 active=0
>  DEBUG: enqueue_dl_entity: flags=1 dl_throttled=0 dl_defer=1
>  DEBUG: update_dl_entity: dl_defer_running=1
>  DEBUG: enqueue_dl_entity: SKIPPING start_dl_timer! dl_throttled=0
>  ...
>  DEBUG: update_curr_dl_se: THROTTLED runtime=-954758
> 
> Fix this by properly resetting dl_defer_running in dl_server_stop(),
> ensuring the server correctly enters the defer phase upon restart.
> 
> This issue is quite difficult to observe when only the fair server
> is present, as the required stop/start patterns are relatively rare.
> However, it becomes easier to trigger with an additional deadline server
> with more frequent server lifecycle transitions (such as a sched_ext
> deadline server).
> 
> This change is a prerequisite for introducing a sched_ext deadline
> server, as it ensures correct and predictable behavior across server
> stop/start cycles.
> 
> Link: https://lore.kernel.org/all/aXEMat4IoNnGYgxw@gpd4/
> Signed-off-by: Andrea Righi <arighi@...dia.com>
> ---

Looks good to me!

Acked-by: Juri Lelli <juri.lelli@...hat.com>

Thanks,
Juri


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ