lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <880890e699117e02d984ba2bb391c63be5fd71e8.camel@codethink.co.uk>
Date: Wed, 18 Jun 2025 12:24:33 +0100
From: Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>
To: Juri Lelli <juri.lelli@...hat.com>
Cc: luca abeni <luca.abeni@...tannapisa.it>, linux-kernel@...r.kernel.org, 
 Ingo Molnar <mingo@...hat.com>, Peter Zijlstra <peterz@...radead.org>,
 Vineeth Pillai	 <vineeth@...byteword.org>
Subject: Re: SCHED_DEADLINE tasks missing their deadline with
 SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)

Hi Juri

On Tue, 2025-06-17 at 14:21 +0200, Juri Lelli wrote:
> On 02/06/25 16:59, Marcel Ziswiler wrote:
> > Hi Juri
> > 
> > On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote:
> 
> ...
> 
> > > It should help us to better understand your setup and possibly reproduce
> > > the problem you are seeing.
> 
> OK, it definitely took a while (apologies), but I think I managed to
> reproduce the issue you are seeing.

No need to apologies, I know how hard it can be trying to bring up random stuff in the Linux world : )

> I added SCHED_FLAG_RECLAIM support to rt-app [1], so it's easier for me
> to play with the taskset and got to the following two situations when
> running your coreX taskset on CPU1 of my system (since the issue is
> already reproducible, I think it's OK to ignore the other tasksets as
> they are running isolated on different CPUs anyway).
> 
> This is your coreX taskset, in which the last task is the bad behaving one that
> will run without/with RECLAIM in the test.
> 
> > sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation |
> > reclaim |
> > -- | -- | -- | -- | -- |
> >  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
> > 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
> > 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
> > 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
> > 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
> > 10 ms  | 0.5 ms  | **1      |  5.00% | no |
> 
> Without reclaim everything looks good (apart from the 1st tasks that I
> think suffers a bit from the granularity/precision of rt-app runtime
> loop):
> 
> https://github.com/jlelli/misc/blob/main/deadline-no-reclaim.png

Yeah, granularity/precision is definitely a concern. We initially even started off with 1 ms sched_deadline =
sched_period for task 1 but neither of our test systems (amd64-based Intel NUCs and aarch64-based RADXA
ROCK5Bs) was able to handle that very well. So we opted to increase it to 5 ms which is still rather stressful.

> Order is the same as above, last tasks gets constantly throttled and
> makes no harm to the rest.
> 
> With reclaim (only last misbehaving task) we indeed seem to have a problem:
> 
> https://github.com/jlelli/misc/blob/main/deadline-reclaim.png
> 
> Essentially all other tasks are experiencing long wakeup delays that
> cause deadline misses. The bad behaving task seems to be able to almost
> monopolize the CPU. Interesting to notice that, even if I left max
> available bandwidth to 95%, the CPU is busy at 100%.

Yeah, pretty much completely overloaded.

> So, yeah, Luca, I think we have a problem. :-)
> 
> Will try to find more time soon and keep looking into this.

Thank you very much and just let me know if I can help in any way.

> Thanks,
> Juri
> 
> 1 - https://github.com/jlelli/rt-app/tree/reclaim

BTW: I will be talking at the OSS NA/ELC next week in Denver should any of you folks be around.

Cheers

Marcel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ