linux-kernel - Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with different slices

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <CAKfTPtD1KXMF9Ak6r2XDrZqAM8kkTGbQ0qsfZJVjq_N_Yj6jBQ@mail.gmail.com>
Date: Mon, 23 Jun 2025 18:27:39 +0200
From: Vincent Guittot <vincent.guittot@...aro.org>
To: Peter Zijlstra <peterz@...radead.org>
Cc: mingo@...hat.com, juri.lelli@...hat.com, dietmar.eggemann@....com, 
	rostedt@...dmis.org, bsegall@...gle.com, mgorman@...e.de, vschneid@...hat.com, 
	linux-kernel@...r.kernel.org
Subject: Re: [PATCH 0/4] sched/fair: Manage lag and run to parity with
 different slices

On Mon, 23 Jun 2025 at 13:16, Peter Zijlstra <peterz@...radead.org> wrote:
>
> On Fri, Jun 20, 2025 at 12:29:27PM +0200, Vincent Guittot wrote:
>
> > yes but at this point any waking up task is either the next running
> > task or enqueued in the rb tree
>
> The scenario I was thinking of was something like:
>
> A (long slice)
> B (short slice)
> C (short slice)
>
>   A wakes up and goes running
>
> Since A is the only task around, it gets normal protection
>
>   B wakes up and doesn't win
>
> So now we have A running with long protection and short task on-rq
>
>   C wakes up ...
>
> Whereas what we would've wanted to end up with for C is A running with
> short protection.

I will look at this case more deeply. We might want to update the
slice protection with the new min slice even if B doesn't preempt A.
That's part of a smarter check_preempt_wakeup_fair that I mentioned
below.
In case of B deadline not being before A, we don't need to update the
protection as the remaining protect duration is already shorter than
the new slice
In case of B not eligible and already on the cpu, B is already
enqueued (delayed dequeue) so its short slice is already accounted in
set protection. I still have to look at B being migrated from another
CPU with negative lag

>
> > > Which is why I approached it by moving the protection to after pick;
> > > because then we can directly compare the task we're running to the
> > > best pick -- which includes the tasks that got woken. This gives
> > > check_preempt_wakeup_fair() better chances.
> >
> > we don't always want to break the run to parity but only when a task
> > wakes up and should preempt current or decrease the run to parity
> > period. Otherwise, the protection applies for a duration that is short
> > enough to stay fair for others
> >
> > I will see if check_preempt_wakeup_fair can be smarter when deciding
> > to cancel the protection
>
> Thanks. In the above scenario B getting selected when C wakes up would
> be a clue I suppose :-)

yes, fixing the comment :
* Note that even if @p does not turn out to be the most eligible
* task at this moment, current's slice protection will be lost.

>
> > > To be fair, I did not get around to testing the patches much beyond
> > > booting them, so quite possibly they're buggered :-/
> > >
> > > > Also, my patchset take into account the NO_RUN_TO_PARITY case by
> > > > adding a notion of quantum execution time which was missing until now
> > >
> > > Right; not ideal, but I suppose for the people that disable
> > > RUN_TO_PARITY it might make sense. But perhaps there should be a little
> > > more justification for why we bother tweaking a non-default option.
> >
> > Otherwise disabling RUN_TO_PARITY to check if it's the root cause of a
> > regression or a problem becomes pointless because the behavior without
> > the feature is wrong.
>
> Fair enough.
>
> > And some might not want to run to parity but behave closer to the
> > white paper with a pick after each quantum with quantum being
> > something in the range [0.7ms:2*tick)
> >
> > >
> > > The problem with usage of normalized_sysctl_ values is that you then get
> > > behavioural differences between 1 and 8 CPUs or so. Also, perhaps its
> >
> > normalized_sysctl_ values don't scale with the number of CPUs. In this
> > case, it's always 0.7ms which is short enough compare to 1ms tick
> > period to prevent default irq accounting to keep current for another
> > tick
>
> Right; but it not scaling means it is the full slice on UP, half the
> slice on SMP-4 and a third for SMP-8 and up or somesuch.

Yes, the goal is to implement a kind of quantum of time which doesn't
scale with number CPUs unlike the default slice duration

>
> It probably doesn't matter much, but its weird.