linux-kernel - Re: CFS flat runqueue proposal fixes/update

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <1eaeeb1d-96c7-4819-8e1e-8f190a650030@arm.com>
Date:   Fri, 7 Aug 2020 16:14:19 +0200
From:   Dietmar Eggemann <dietmar.eggemann@....com>
To:     Rik van Riel <riel@...riel.com>,
        Peter Zijlstra <peterz@...radead.org>
Cc:     Paul Turner <pjt@...gle.com>,
        "vincent.guittot" <vincent.guittot@...aro.org>, kernel-team@...com,
        "linux-kernel@...r.kernel.org" <linux-kernel@...r.kernel.org>,
        "dietmar.eggeman" <dietmar.eggeman@....com>
Subject: Re: CFS flat runqueue proposal fixes/update

Hi Rik,

On 31/07/2020 09:42, Rik van Riel wrote:
> Hello,
> 
> last year at Linux Plumbers conference, I presented on my work
> of turning the hierarchical CFS runqueue into a flat runqueue,
> and Paul Turner pointed out some corner cases that could not
> work with my design as it was last year.
> 
> Paul pointed out two corner cases, and I have come up with a
> third one myself, but I believe they all revolve around the
> same issue, admission control, and can all be solved with the
> same solution.

[...]

>     Possible solution
> 
> A possible solution to the problems above consists of
> three parts:
> - Instead of accounting all used wall clock CPU time into
>   vruntime immediately, at the task's current hierarchical
>   priority, the vruntime can be accounted for piecemeal, for
>   example in amounts corresponding to "one timeslice at full
>   priority".
> - If there is only one runnable task on the runqueue,
>   all the runtime can be accounted into vruntime in one go.
> - Tasks that cannot account all of their used CPU time
>   into vruntime at once can be removed from the root runqueue,
>   and placed into the cgroup runqueue. A heap of cgroup
>   runqueues with "overloaded" tasks can be attached to the
>   main runqueue, where the left-most task from that heap of
>   heaps gets some vruntime accounted every time we go into
>   pick_next_task.
> - The difference between the vruntime of each task in that
>   heap and the vruntime of the root runqueue can help determine
>   how much vruntime can be accounted to that task at once.
> - If the task, or its runqueue, is no longer the left-most
>   in the heap after getting vruntime accounted, that runqueue
>   and the queue of runqueues can be resorted.
> - Once a task has accounted all of its outstanding delta
>   exec runtime into vruntime, it can be moved back to the
>   main runqueue.
> - This should solve the unequal task weight scenario Paul
>   Turner pointed out last year, because after task t1 goes
>   to sleep and only t2 and t3 remain on the CPU, t2 will
>   get its delta exec runtime converted into vruntime at
>   its new priority (equal to t3).
> - By only accounting delta exec runtime to vruntime for
>   the left-most task in the "overloaded" heap at one time,
>   we guarantee only one task at a time will be added back
>   into the root runqueue.
> - Every time a task is added to the root runqueue, that
>   slows down the rate at which vruntime advances, which
>   in turn reduces the rate at which tasks get added back
>   into the runqueue, and makes it more likely that a currently
>   running task with low hierarchical priority gets booted
>   off into the "overloaded" heap.
> 
> To tackle the thundering herd at task wakeup time, another
> strategy may be needed. One thing we may be able to do
> there is place tasks into the "overloaded" heap immediately
> on wakeup, if the hierarchical priority of the task is so
> low that if the task were to run a minimal timeslice length,
> it would be unable to account that time into its vruntime
> in one go, AND the CPU already has a larger number of tasks
> on it.
> 
> Because the common case on most systems is having just
> 0, 1, or 2 runnable tasks on a CPU, this fancy scheme
> should rarely end up being used, and even when it is the
> overhead should be reasonable because most of the
> overloaded tasks will just sit there until pick_next_task
> gets around to them.
> 
> Does this seem like something worth trying out?
> 
> Did I overlook any other corner cases that would make this
> approach unworkable?
> 
> Did I forget to explain anything that is needed to help
> understand the problem and the proposed solution better?

I imagine that I can see what you want to achieve here ;-)

But it's hard since your v5 RFC
https://lkml.kernel.org/r/20190906191237.27006-1-riel@surriel.com is
pretty old by now.

Do you have a version of the patch-set against tip/sched/core? Quite a
lot has changed (runnable load avg replaced by runnable avg, rq->load is
gone, CFS load balance rework).

IIRC, the 'CFS flat runqueue design' has the advantage to reduce the
overhead in taskgroup heavy environments like systemd. And I recall that
v5 doesn't cover CFS bandwidth control yet.

IMHO it would be extremely helpful to have a current patch-set to
discuss how these other problems can be covered by patches on top.