linux-kernel - Re: SCHED_DEADLINE tasks missing their deadline with SCHED_FLAG

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <aFUqELdqM8VcyNCh@jlelli-thinkpadt14gen4.remote.csb>
Date: Fri, 20 Jun 2025 11:29:52 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>
Cc: luca abeni <luca.abeni@...tannapisa.it>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Vineeth Pillai <vineeth@...byteword.org>
Subject: Re: SCHED_DEADLINE tasks missing their deadline with
 SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)

On 18/06/25 12:24, Marcel Ziswiler wrote:

...

> Yeah, granularity/precision is definitely a concern. We initially even started off with 1 ms sched_deadline =
> sched_period for task 1 but neither of our test systems (amd64-based Intel NUCs and aarch64-based RADXA
> ROCK5Bs) was able to handle that very well. So we opted to increase it to 5 ms which is still rather stressful.

Ah, OK, even though I meant granularity of the 'fake' runtime of the
tasks. In rt-app we simulate it by essentially reading the clock until
that much runtime elapsed (or performing floating point operations) and
in some cases is not super tight.

For runtime enforcement (dl_runtime) and/or period/deadline (dl_{period,
deadline}), did you try enabling HRTICK_DL sched feature? It is kind of
required for parameters under 1ms if one wants precise behavior.

> > Order is the same as above, last tasks gets constantly throttled and
> > makes no harm to the rest.
> > 
> > With reclaim (only last misbehaving task) we indeed seem to have a problem:
> > 
> > https://github.com/jlelli/misc/blob/main/deadline-reclaim.png
> > 
> > Essentially all other tasks are experiencing long wakeup delays that
> > cause deadline misses. The bad behaving task seems to be able to almost
> > monopolize the CPU. Interesting to notice that, even if I left max
> > available bandwidth to 95%, the CPU is busy at 100%.
> 
> Yeah, pretty much completely overloaded.
> 
> > So, yeah, Luca, I think we have a problem. :-)
> > 
> > Will try to find more time soon and keep looking into this.
> 
> Thank you very much and just let me know if I can help in any way.

I have been playing a little more with this and noticed (by chance) that
after writing a value on sched_rt_runtime_us (even the 950000 default)
this seem to 'work' - I don't see deadline misses anymore.

I thus have moved my attention to GRUB related per-cpu variables [1] and
noticed something that looks fishy with extra_bw: after boot and w/o any
DEADLINE tasks around (other than dl_servers) all dl_rqs have different
values [2]. E.g.,

  extra_bw   : (u64)447170
  extra_bw   : (u64)604454
  extra_bw   : (u64)656882
  extra_bw   : (u64)691834
  extra_bw   : (u64)718048
  extra_bw   : (u64)739018
  extra_bw   : (u64)756494
  extra_bw   : (u64)771472
  extra_bw   : (u64)784578
  extra_bw   : (u64)796228
  ...

When we write a value to sched_rt_runtime_us only extra_bw of the first
cpu of a root_domain gets updated. So, this might be the reason why
things seem to improve with single CPU domains like in the situation at
hand, but still probably broken in general. I think the issue here is
that we end up calling init_dl_rq_bw_ratio() only for the first cpu
after the introduction of dl_bw_visited() functionality.

So, this might be one thing to look at, but I am honestly still confused
by why we have weird numbers as the above after boot. Also a bit
confused by the actual meaning and purpose of the 5 GRUB variables we
have to deal with.

Luca, Vineeth (for the recent introduction of max_bw), maybe we could
take a step back and re-check (and maybe and document better :) what
each variable is meant to do and how it gets updated?

Thanks!
Juri

1 - Starts at https://elixir.bootlin.com/linux/v6.16-rc2/source/kernel/sched/sched.h#L866
2 - The drgn script I am using
---
#!/usr/bin/env drgn

desc = """
This is a drgn script to show the current root domains configuration. For more
info on drgn, visit https://github.com/osandov/drgn.
"""

import os
import argparse

import drgn
from drgn import FaultError, NULL, Object, alignof, cast, container_of, execscript, implicit_convert, offsetof, reinterpret, sizeof, stack_trace
from drgn.helpers.common import *
from drgn.helpers.linux import *

def print_dl_bws_info():

    print("Retrieving dl_rq  Information:")

    runqueues = prog['runqueues']

    for cpu_id in for_each_possible_cpu(prog):
        try:
            rq = per_cpu(runqueues, cpu_id)

            dl_rq = rq.dl

            print(f"  From CPU: {cpu_id}")

            print(f"  running_bw : {dl_rq.running_bw}")
            print(f"  this_bw    : {dl_rq.this_bw}")
            print(f"  extra_bw   : {dl_rq.extra_bw}")
            print(f"  max_bw     : {dl_rq.max_bw}")
            print(f"  bw_ratio   : {dl_rq.bw_ratio}")

        except drgn.FaultError as fe:
            print(f"  (CPU {cpu_id}: Fault accessing kernel memory: {fe})")
        except AttributeError as ae:
            print(f"  (CPU {cpu_id}: Missing attribute for dl_rq (kernel struct change?): {ae})")
        except Exception as e:
            print(f"  (CPU {cpu_id}: An unexpected error occurred: {e})")

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=desc,
                                     formatter_class=argparse.RawTextHelpFormatter)
    args = parser.parse_args()

    print_dl_bws_info()
---