linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Message-ID: <798768ad5db073d36467a432352b968b01649898.camel@surriel.com>
Date: Mon, 06 May 2024 14:47:47 -0400
From: Rik van Riel <riel@...riel.com>
To: Peter Zijlstra <peterz@...radead.org>, Tejun Heo <tj@...nel.org>
Cc: torvalds@...ux-foundation.org, mingo@...hat.com, juri.lelli@...hat.com, 
 vincent.guittot@...aro.org, dietmar.eggemann@....com, rostedt@...dmis.org, 
 bsegall@...gle.com, mgorman@...e.de, bristot@...hat.com,
 vschneid@...hat.com,  ast@...nel.org, daniel@...earbox.net,
 andrii@...nel.org, martin.lau@...nel.org,  joshdon@...gle.com,
 brho@...gle.com, pjt@...gle.com, derkling@...gle.com,  haoluo@...gle.com,
 dvernet@...a.com, dschatzberg@...a.com, dskarlat@...cmu.edu, 
 changwoo@...lia.com, himadrics@...ia.fr, memxor@...il.com, 
 andrea.righi@...onical.com, joel@...lfernandes.org,
 linux-kernel@...r.kernel.org,  bpf@...r.kernel.org, kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Fri, 2024-05-03 at 10:52 +0200, Peter Zijlstra wrote:
> On Thu, May 02, 2024 at 09:20:15AM -1000, Tejun Heo wrote:
> > Hello, Peter.
> > 
> > On Thu, May 02, 2024 at 10:48:00AM +0200, Peter Zijlstra wrote:
> > > Can you please put your efforts and the touted Google
> > > collaboration in
> > > fixing the existing cgroup mess?
> > 
> > I suppose you're referring to Rik's flattened hierarchy patchset.
> > 
> >  
> > https://lore.kernel.org/all/20190822021740.15554-1-riel@surriel.com
> > 
> 
> You guys Google/Facebook got us the cgroup thing, Google did a lot of
> the work for cpu-cgroup, and now you Facebook say you can't live with
> it
> because it's too expensive. Yes Rik did put a lot of effort into it,
> but
> Google shot it down. What am I to do?

I believe the issues that Paul pointed out with my
flattened cgroup code are fixable. I ended up not
getting back to this code because it took me a few
months to think of ways to fix the issues Paul found,
and by then I had moved on to other projects.

For reference, Paul found these two (very real) issues
with my implementation.

1) Thundering herd problem. If many tasks in a low
   priority cgroup wake up at the same time, they can
   end up swamping a CPU.

   I believe this can be solved with the same idea
   I had for reimplementing CONFIG_CFS_BANDWIDTH.
   Specifically, the code that determines the time
   slice length for a task already has a way to
   determine whether a CPU is "overloaded", and
   time slices need to be shortened.  Once we reach
   that situation, we can place woken up tasks on
   a secondary heap of per-cgroup runqueues, from
   which we do not directly run tasks, but pick
   the lowest vruntime task from the lowest vruntime
   cgroup and put that on the main runqueue, if
   the previously running task has a vruntime that
   is higher than that of a task in the secondary
   group. If a task is woken up in a cgroup that
   already has tasks on that secondary queue, we
   wake up the task onto that secondary queue.

   This means on overloaded CPUs, we move back to
   a task selection mechanism closer to what we
   currently have, while in the non-overloaded
   situation we use a flat runqueue.

   This same scheme could be used to implement
   CFS bandwidth control. A task belonging to a
   throttled group would be placed on the group's
   queue, not the CPU's flat runqueue.

2) The vruntime for a task can be advanced by way
   to much at once. If we have tasks A & B running,
   and task B has a priority that is 1/100th of that
   of task A, its vruntime would be advanced 100x
   as much as task A, when running the same length
   time slice.

   This creates a big issue if we get a wakeup of
   task C, at the same priority as task B, and then
   task A goes to sleep. Due to the very far advanced
   runtime of task B, task C could get to monopolize
   the CPU for a considerable amount of time, and
   task B could get starved.

   A potential fix for this is to never account more
   than the maximum time slice length at a time, while
   any excess delta_exec time for the task gets remembered.

   At pick_next_entity time, the scheduler can see that
   task B has a lot of delta_exec time left, and account
   up to the maximum slice length to the task's vruntime,
   and place it back in the queue if the next task now has
   a lower vruntime.

   For a steady state of a high priority task A and a low
   priority task B, this makes pick_next_task more expensive,
   but when task A disappears and task C appears, CPU time
   will continue to be fair between them.

   Limiting the total weight of tasks on the flat runqueue,
   using the mechanism for thundering herd and CFS bandwidth
   outlined above, should keep this overhead bounded to
   something reasonable.

Does the above sound like it would work?

Does it sound like code that you would be ok with merging?

Is it a large enough improvement over the current hierarchical
runqueue that it would be worth doing?

This would be a fairly large project, so we should probably discuss
some of the details before investing too much time in it.

-- 
All Rights Reversed.