lists.openwall.net   lists  /  announce  owl-users  owl-dev  john-users  john-dev  passwdqc-users  yescrypt  popa3d-users  /  oss-security  kernel-hardening  musl  sabotage  tlsify  passwords  /  crypt-dev  xvendor  /  Bugtraq  Full-Disclosure  linux-kernel  linux-netdev  linux-ext4  linux-hardening  linux-cve-announce  PHC 
Open Source and information security mailing list archives
 
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [day] [month] [year] [list]
Date: Tue, 9 Apr 2024 09:35:27 +0200
From: Tobias Huschle <huschle@...ux.ibm.com>
To: Vincent Guittot <vincent.guittot@...aro.org>
Cc: Luis Machado <luis.machado@....com>, linux-kernel@...r.kernel.org,
        mingo@...hat.com, peterz@...radead.org, juri.lelli@...hat.com,
        dietmar.eggemann@....com, rostedt@...dmis.org, bsegall@...gle.com,
        mgorman@...e.de, bristot@...hat.com, vschneid@...hat.com,
        sshegde@...ux.vnet.ibm.com, srikar@...ux.vnet.ibm.com,
        linuxppc-dev@...ts.ozlabs.org, nd <nd@....com>
Subject: Re: [RFC] sched/eevdf: sched feature to dismiss lag on wakeup

On Fri, Mar 22, 2024 at 06:02:05PM +0100, Vincent Guittot wrote:
> and then
>     se->vruntime = max_vruntime(se->vruntime, vruntime)
> 

First things first, I was wrong to assume a "boost" in the CFS code. So I
dug a bit deeper and tried to pinpoint what the difference between CFS and
EEVDF actually is. I found the following:

Let's assume we have two tasks taking turns on a single CPU.
Task 1 is always runnable.
Task 2 gets woken up by task 1 and goes back to sleep when it is done.
This means, task 1 runs, wakes up task 2, task 2 runs, goes to sleep and
task 1 runs again and we repeat.
Most of the time: runtime(task1) > runtime(task2)
Rare occasions:   runtime(task1) < runtime(task2)
So, task 1 usually consumes more of its designated time slices until it gets
rescheduled by the wakeup of task2 than task 2 does. But both never consume
their full time slice. Rather the opposite, both run for low 5-digit ns or
less.

So something like this:

task 1    |----------|    |---------|    |------...
task 2               |----|         |----|

This creates different behaviors under CFS and EEVDF:

### CFS ####################################

In CFS the difference in runtimes means that task 2 cannot catch up with 
task 1 vruntime-wise

With every exchange between task 1 and task 2, task 2 falls back more on
vruntime. Once a difference in the magnitude of sysctl_sched_latency is 
established, the difference remains stable due to the max handling in 
place_entity.

Occasionally, task 2 may run longer than task 1. In those cases, it
will catch up slightly. But in the majority of cases, task 2 runs
shorter, thereby increasing the difference in vruntime.

This would explain why task 2 gets always scheduled immediately on wakeup.

### EEVDF ##################################

The rare occasions where task 2 runs longer than task 1 seem to cause 
issues with EEVDF:

In the regular case where task 1 runs longer than task 2. Task 2 gets 
a positive lag and is selected on wake up --> good.
In the irregular case where task 2 runs longer than task 1 task 2 
now gets a negative lag and is no longer chosen on wakeup --> bad (in some cases).

This would explain why task 2 gets not selected on wake up occasionally. 

### Summary ################################

So my wording, that a woken up task gets "boosted" was obviously wrong. 
Task 2 is not getting boosted in CFS, it gets "outrun" by task 1, with 
no chance of catching up. Leaving it with a smaller vruntime value.

EEVDF on the other hand, does not allow lag to accumulate if an entity, like 
task 2 in this case, regularly dequeues itself. So it will always have 
a lag with an upper boundary of whatever difference it encountered in 
comparison to the runtime with task 1.

The patch below, allows tasks to accumulate lag over time. This fixes the
original regression, that made me stumble into this topic. But, this might 
of course come with arbitrary side effects.

I'm not suggesting to actually implement this, but would like to confirm 
whether my understanding is correct that this is the aspect where CFS and 
EEVDF differ, where CFS is more aware of the past in this particular case
than EEVDF is.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 03be0d1330a6..b83a72311d2a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -701,7 +701,7 @@ static void update_entity_lag(struct cfs_rq *cfs_rq, struct sched_entity *se)
        s64 lag, limit;
 
        SCHED_WARN_ON(!se->on_rq);
-       lag = avg_vruntime(cfs_rq) - se->vruntime;
+       lag = se->vlag + avg_vruntime(cfs_rq) - se->vruntime;
 
        limit = calc_delta_fair(max_t(u64, 2*se->slice, TICK_NSEC), se);
        se->vlag = clamp(lag, -limit, limit);

Powered by blists - more mailing lists

Powered by Openwall GNU/*/Linux Powered by OpenVZ