lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Message-ID: <5c2f8c8b04e1e36d721c1f90f39c02dd5d971580.camel@codethink.co.uk>
Date: Sat, 03 May 2025 15:14:50 +0200
From: Marcel Ziswiler <marcel.ziswiler@...ethink.co.uk>
To: luca abeni <luca.abeni@...tannapisa.it>, Juri Lelli
 <juri.lelli@...hat.com>
Cc: linux-kernel@...r.kernel.org, Ingo Molnar <mingo@...hat.com>, Peter
 Zijlstra <peterz@...radead.org>, Vineeth Pillai <vineeth@...byteword.org>
Subject: Re: SCHED_DEADLINE tasks missing their deadline with
 SCHED_FLAG_RECLAIM jobs in the mix (using GRUB)

Hi Luca

On Fri, 2025-05-02 at 16:10 +0200, luca abeni wrote:
> Hi all,
> 
> On Fri, 2 May 2025 15:55:42 +0200
> Juri Lelli <juri.lelli@...hat.com> wrote:
> 
> > Hi Marcel,
> > 
> > On 28/04/25 20:04, Marcel Ziswiler wrote:
> > > Hi
> > > 
> > > As part of our trustable work [1], we also run a lot of real time
> > > scheduler (SCHED_DEADLINE) tests on the mainline Linux kernel.
> > > Overall, the Linux scheduler proves quite capable of scheduling
> > > deadline tasks down to a granularity of 5ms on both of our test
> > > systems (amd64-based Intel NUCs and aarch64-based RADXA ROCK5Bs).
> > > However, recently, we noticed a lot of deadline misses if we
> > > introduce overrunning jobs with reclaim mode enabled
> > > (SCHED_FLAG_RECLAIM) using GRUB (Greedy Reclamation of Unused
> > > Bandwidth). E.g. from hundreds of millions of test runs over the
> > > course of a full week where we usually see absolutely zero deadline
> > > misses, we see 43 million deadline misses on NUC and 600 thousand
> > > on ROCK5B (which also has double the CPU cores). This is with
> > > otherwise exactly the same test configuration, which adds exactly
> > > the same two overrunning jobs to the job mix, but once without
> > > reclaim enabled and once with reclaim enabled.
> > > 
> > > We are wondering whether there are any known limitations to GRUB or
> > > what exactly could be the issue.
> > > 
> > > We are happy to provide more detailed debugging information but are
> > > looking for suggestions how/what exactly to look at.  
> > 
> > Could you add details of the taskset you are working with? The number
> > of tasks, their reservation parameters (runtime, period, deadline)
> > and how much they are running (or trying to run) each time they wake
> > up. Also which one is using GRUB and which one maybe is not.
> > 
> > Adding Luca in Cc so he can also take a look.
> 
> Thanks for cc-ing me, Jury! 
> 
> Marcel, are your tests on a multi-core machine with global scheduling?
> If yes, we should check if the taskset is schedulable.

Yes, as previously mentioned, we run all our tests on multi-core machines. Not sure though, what exactly you
are referring to by "global scheduling". Do you mean using Global Earliest Deadline First (GEDF)? I guess that
is what SCHED_DEADLINE is using, not?

Concerning the taskset being schedulable, it is not that it does not schedule at all. Remember, from hundreds
of millions of test runs over the course of a full week where we usually see absolutely zero deadline misses
(without reclaim), we see 43 million deadline misses (with that one rogue process set to reclaim) on NUC and
600 thousand on ROCK5B (which also has double the CPU cores).

Please let me know if you need any further details which may help figuring out what exactly is going on.

> 			Thanks,

Thank you!

> 				Luca

Cheers

Marcel

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ