lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <aFFdseGAqImLtVCH@jlelli-thinkpadt14gen4.remote.csb>
Date: Tue, 17 Jun 2025 14:21:05 +0200
From: Juri Lelli <juri.lelli@...hat.com>
To: Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>
Cc: luca abeni <luca.abeni@...tannapisa.it>, linux-kernel@...r.kernel.org,
	Ingo Molnar <mingo@...hat.com>,
	Peter Zijlstra <peterz@...radead.org>,
	Vineeth Pillai <vineeth@...byteword.org>
Subject: Re: SCHED_DEADLINE tasks missing their deadline with
 SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)

On 02/06/25 16:59, Marcel Ziswiler wrote:
> Hi Juri
> 
> On Thu, 2025-05-29 at 11:39 +0200, Juri Lelli wrote:

...

> > It should help us to better understand your setup and possibly reproduce
> > the problem you are seeing.

OK, it definitely took a while (apologies), but I think I managed to
reproduce the issue you are seeing.

I added SCHED_FLAG_RECLAIM support to rt-app [1], so it's easier for me
to play with the taskset and got to the following two situations when
running your coreX taskset on CPU1 of my system (since the issue is
already reproducible, I think it's OK to ignore the other tasksets as
they are running isolated on different CPUs anyway).

This is your coreX taskset, in which the last task is the bad behaving one that
will run without/with RECLAIM in the test.

|sched_deadline = sched_period | sched_runtime | CP max run time 90% of sched_runtime | utilisation | reclaim |
| -- | -- | -- | -- | -- |
|  5 ms  | 0.15 ms | 0.135 ms |  3.00% | no |
| 10 ms  | 1.8 ms  | 1.62 ms  | 18.00% | no |
| 10 ms  | 2.1 ms  | 1.89 ms  | 21.00% | no |
| 14 ms  | 2.3 ms  | 2.07 ms  | 16.43% | no |
| 50 ms  | 8.0 ms  | 7.20 ms  | 16:00% | no |
| 10 ms  | 0.5 ms  | **1      |  5.00% | no |

Without reclaim everything looks good (apart from the 1st tasks that I
think suffers a bit from the granularity/precision of rt-app runtime
loop):

https://github.com/jlelli/misc/blob/main/deadline-no-reclaim.png

Order is the same as above, last tasks gets constantly throttled and
makes no harm to the rest.

With reclaim (only last misbehaving task) we indeed seem to have a problem:

https://github.com/jlelli/misc/blob/main/deadline-reclaim.png

Essentially all other tasks are experiencing long wakeup delays that
cause deadline misses. The bad behaving task seems to be able to almost
monopolize the CPU. Interesting to notice that, even if I left max
available bandwidth to 95%, the CPU is busy at 100%.

So, yeah, Luca, I think we have a problem. :-)

Will try to find more time soon and keep looking into this.

Thanks,
Juri

1 - https://github.com/jlelli/rt-app/tree/reclaim


Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ