linux-kernel - Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add lag based placement)

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite: Windows password security audit tool. GUI, reports in PDF.
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240501105151.GG40213@noisy.programming.kicks-ass.net>
Date: Wed, 1 May 2024 12:51:51 +0200
From: Peter Zijlstra <peterz@...radead.org>
To: Tobias Huschle <huschle@...ux.ibm.com>
Cc: "Michael S. Tsirkin" <mst@...hat.com>,
	Luis Machado <luis.machado@....com>,
	Jason Wang <jasowang@...hat.com>,
	Abel Wu <wuyun.abel@...edance.com>,
	Linux Kernel <linux-kernel@...r.kernel.org>, kvm@...r.kernel.org,
	virtualization@...ts.linux.dev, netdev@...r.kernel.org,
	nd <nd@....com>, borntraeger@...ux.ibm.com,
	Ingo Molnar <mingo@...nel.org>,
	Mike Galbraith <umgwanakikbuti@...il.com>
Subject: Re: EEVDF/vhost regression (bisected to 86bfbb7ce4f6 sched/fair: Add
 lag based placement)

On Tue, Apr 30, 2024 at 12:50:05PM +0200, Tobias Huschle wrote:
> It took me a while, but I was able to figure out why EEVDF behaves 
> different then CFS does. I'm still waiting for some official confirmation
> of my assumptions but it all seems very plausible to me.
> 
> Leaving aside all the specifics of vhost and kworkers, a more general
> description of the scenario would be as follows:
> 
> Assume that we have two tasks taking turns on a single CPU. 
> Task 1 does something and wakes up Task 2.
> Task 2 does something and goes to sleep.
> And we're just repeating that.
> Task 1 and task 2 only run for very short amounts of time, i.e. much 
> shorter than a regular time slice (vhost = task1, kworker = task2).
> 
> Let's further assume, that task 1 runs longer than task 2. 
> In CFS, this means, that vruntime of task 1 starts to outrun the vruntime
> of task 2. This means that vruntime(task2) < vruntime(task1). Hence, task 2
> always gets picked on wake up because it has the smaller vruntime. 
> In EEVDF, this would translate to a permanent positive lag, which also 
> causes task 2 to get consistently scheduled on wake up.
> 
> Let's now assume, that ocassionally, task 2 runs a little bit longer than
> task 1. In CFS, this means, that task 2 can close the vruntime gap by a
> bit, but, it can easily remain below the value of task 1. Task 2 would 
> still get picked on wake up.
> With EEVDF, in its current form, task 2 will now get a negative lag, which
> in turn, will cause it not being picked on the next wake up.

Right, so I've been working on changes where tasks will be able to
'earn' credit when sleeping. Specifically, keeping dequeued tasks on the
runqueue will allow them to burn off negative lag. Once they get picked
again they are guaranteed to have zero (or more) lag. If by that time
they've not been woken up again, they get dequeued with 0-lag.

(placement with 0-lag will ensure eligibility doesn't inhibit the pick,
but is not sufficient to ensure a pick)

However, this alone will not be sufficient to get the behaviour you
want. Notably, even at 0-lag the virtual deadline will still be after
the virtual deadline of the already running task -- assuming they have
equal request sizes.

That is, IIUC, you want your task 2 (kworker) to always preempt task 1
(vhost), right? So even if tsak 2 were to have 0-lag, placing it would
be something like:

t1      |---------<    
t2        |---------<
V    -----|-----------------------------

So t1 has started at | with a virtual deadline at <. Then a short
while later -- V will have advanced a little -- it wakes t2 with 0-lag,
but as you can observe, its virtual deadline will be later than t1's and
as such it will never get picked, even though they're both eligible.

> So, it seems we have a change in the level of how far the both variants look 
> into the past. CFS being willing to take more history into account, whereas
> EEVDF does not (with update_entity_lag setting the lag value from scratch, 
> and place_entity not taking the original vruntime into account).
>
> All of this can be seen as correct by design, a task consumes more time
> than the others, so it has to give way to others. The big difference
> is now, that CFS allowed a task to collect some bonus by constantly using 
> less CPU time than others and trading that time against ocassionally taking
> more CPU time. EEVDF could do the same thing, by allowing the accumulation
> of positive lag, which can then be traded against the one time the task
> would get negative lag. This might clash with other EEVDF assumptions though.

Right, so CFS was a pure virtual runtime based scheduler, while EEVDF
considers both virtual runtime (for eligibility, which ties to fairness)
but primarily virtual deadline (for timeliness).

If you want to make EEVDF force pick a task by modifying vruntime you
have to place it with lag > request (slice) such that the virtual
deadline of the newly placed task is before the already running task,
yielding both eligibility and earliest deadline.

Consistently placing tasks with such large (positive) lag will affect
fairness though, they're basically always runnable, so barring external
throttling, they'll starve you.

> The patch below fixes the degredation, but is not at all aligned with what 
> EEVDF wants to achieve, but it helps as an indicator that my hypothesis is
> correct.
> 
> So, what does this now mean for the vhost regression we were discussing?
> 
> 1. The behavior of the scheduler changed with regard to wake-up scenarios.
> 2. vhost in its current form relies on the way how CFS works by assuming 
>    that the kworker always gets scheduled.

How does it assume this? Also, this is a performance issue, not a
correctness issue, right?

> I would like to argue that it therefore makes sense to reconsider the vhost
> implementation to make it less dependent on the internals of the scheduler.

I think I'll propose the opposite :-) Much of the problems we have are
because the scheduler simply doesn't know anything and we're playing a
mutual guessing game.

The trick is finding things to tell the scheduler it can actually do
something with though..

> As proposed earlier in this thread, I see two options:
> 
> 1. Do an explicit schedule() after every iteration across the vhost queues
> 2. Set the need_resched flag after writing to the socket that would trigger
>    eventfd and the underlying kworker

Neither of these options will get you what you want. Specifically in the
example above, t1 doing an explicit reschedule will result in t1 being
picked.

> Both options would make sure that the vhost gives up the CPU as it cannot
> continue anyway without the kworker handling the event. Option 1 will give
> up the CPU regardless of whether something was found in the queues, whereas
> option 2 would only give up the CPU if there is.

Incorrect, neither schedule() nor marking things with TIF_NEED_RESCHED
(which has more issues) will make t2 run. In that scenario you have to
make t1 block, such that t2 is the only possible choice. As long as you
keep t1 on the runqueue, it will be the most eligible pick at that time.

Now, there is an easy option... but I hate to mention it because I've
spend a lifetime telling people not to use it (for really good reasons):
yield().

With EEVDF yield() will move the virtual deadline ahead by one request.
That is, given the above scenario:

t1      |---------<    
t2        |---------<
V    -----|-----------------------------

t1 doing yield(), would result in:

t1      |-------------------<    
t2        |---------<
V    -----|-----------------------------

And at that point, you'll find that all of a sudden t2 will be picked.
On the flip side, you might find that when t2 completes another task is
more likely to run than return to t1 -- because of that elongated
deadline. Ofc. if t1 and t2 are the only tasks on the CPU this doesn't
matter.

> It shall be noted, that we encountered similar behavior when running some
> fio benchmarks. From a brief glance at the code, I was seeing similar
> intentions: Loop over queues, then trigger an action through some event
> mechanism. Applying the same patch as mentioned above also fixes this issue.
> 
> It could be argued, that this is still something that needs to be somehow
> addressed by the scheduler since it might affect others as well and there 
> are in fact patches coming in. Will they address our issue here? Not sure yet.

> On the other hand, it might just be beneficial to make vhost more resilient
> towards the scheduler's algorithm by not relying on a certain behavior in
> the wakeup path.

So the 'advantage' of EEVDF over CFS is that it has 2 parameters to play
with: weight and slice. Slice being the new toy in town.

Specifically in your example you would ideally have task 2 have a
shorter slice. Except of course its a kworker and you can't very well
set a kworker with a short slice because you never know wth it will end
up doing.

I'm still wondering why exactly it is imperative for t2 to preempt t1.
Is there some unexpressed serialization / spin-waiting ?