linux-kernel - Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives
Hash Suite for Android: free password hash cracker in your pocket
[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]
Message-ID: <20240514213402.GB295811@maniforge>
Date: Tue, 14 May 2024 16:34:02 -0500
From: David Vernet <void@...ifault.com>
To: Qais Yousef <qyousef@...alina.io>
Cc: Steven Rostedt <rostedt@...dmis.org>,
	Peter Zijlstra <peterz@...radead.org>, Tejun Heo <tj@...nel.org>,
	torvalds@...ux-foundation.org, mingo@...hat.com,
	juri.lelli@...hat.com, vincent.guittot@...aro.org,
	dietmar.eggemann@....com, bsegall@...gle.com, mgorman@...e.de,
	bristot@...hat.com, vschneid@...hat.com, ast@...nel.org,
	daniel@...earbox.net, andrii@...nel.org, martin.lau@...nel.org,
	joshdon@...gle.com, brho@...gle.com, pjt@...gle.com,
	derkling@...gle.com, haoluo@...gle.com, dvernet@...a.com,
	dschatzberg@...a.com, dskarlat@...cmu.edu, riel@...riel.com,
	changwoo@...lia.com, himadrics@...ia.fr, memxor@...il.com,
	andrea.righi@...onical.com, joel@...lfernandes.org,
	linux-kernel@...r.kernel.org, bpf@...r.kernel.org,
	kernel-team@...a.com
Subject: Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

On Tue, May 14, 2024 at 01:07:15AM +0100, Qais Yousef wrote:

[...]

> > > 
> > > How does this BPF muck translate into better quality patches for me?
> > 
> > Here's how we will be using it (we will likely be porting sched_ext to
> > ChromeOS regardless of its acceptance).
> > 
> > Doing testing of scheduler changes in the field is extremely time
> > consuming and complex. We tested EEVDF vs CFS by backporting EEVDF to
> > 5.15 (as that is the kernel version we are using on the chromebooks we
> > were testing on), and then we need to add a user space "switch" to
> > change the scheduler. Note, this also risks causing a bug in adding
> > these changes. Then we push the kernel out, and then start our
> > experiment that enables our feature to a small percentage, and slowly
> > increases the number of users until we have a enough for a statistical
> > result.
> > 
> > What sched_ext would give us is a easy way to try different scheduling
> > algorithms and get feedback much quicker. Once we determine a solution
> > that improves things, we would then spend the time to implement it in
> > the scheduler, and yes, send it upstream.
> > 
> > To me, sched_ext should never be the final solution, but it can be
> > extremely useful in testing various changes quickly in the field. Which
> > to me would encourage more contributions.

Hello Qais,

[...]

> I really don't buy the rapid development aspect too. The scheduler was heavily

There are already several examples from users who have shown that the rapid
development and experimentation is extremely useful. Imagine if you're
iterating on the scheduler to improve p99 frame rates on the Steam Deck, as
Changwoo described. It's much more efficient to be able to just tweak and load
a BPF scheduler (that is safe and can't crash the machine) to try some random
idea out than it is to:

1. Tweak and recompile the kernel
2. Reinstall the kernel on the Steam Deck
3. Reboot the Steam Deck
4. Reload a game and let caches rewarm
5. Measure FPS

You're talking about a 5 second compile job + 1 second to reload a safe BPF
scheduler vs. having to do all of the above steps _and_ potentially making a
mistake that brings the machine down. These benefits are also extremely useful
for testing workloads on production servers, etc. Let’s also not forget that
unlike many other kernel features, you probably can’t get reliable scheduling
results from running in a VM. The experimentation overhead is very real.

[...]

> influenced by the early contributors which come from server market that had
> (few) very specific workloads they needed to optimize for and throughput had
> a heavier weight vs latency. Fast forward to now, things are different. Even on
> server market latency/responsiveness has become more important. Power and
> thermal are important on a larger class of systems now too. I'd dare say even
> on server market. How do you know when it's okay for an app/task to consume too
> much power and when it is not? Hint hint, you can't unless someone in userspace
> tells you. Similarly for latency vs throughput. What is the correct way to
> write an application to provide this info? Then we can ask what is missing in
> the scheduler to enable this.

Hmm, you seem to be arguing that the way forward here is to have our one
general purpose scheduler be entirely driven by user space hinting. Assuming
I’m not misunderstanding you, I strongly disagree with this sentiment.  User
space hinting can be powerful, but I think we need to have a general purpose
scheduler that's completely agnostic to whatever is running in user space.
We’ve also been able to get strong results from sched_ext schedulers that don’t
use any user space hinting.

Also, even if this ended up being the way forward, I don’t see it being
practical to implement. Wouldn’t it require us to update all of user space
globally just to update how it interfaces with the scheduler?

[...]

> Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> default for throughput by the way (server market bias). You can manipulate
> those and get better latencies.

Those knobs aren't available anymore in EEVDF.
 
[...]

> point IMO, not the scheduler algorithm. If the latter need to change, it needs
> to be as the result of this friction - which what EEVDF came about from to my
> understanding. To enable implementing a latency interface easier. But Vincent
> had a working implementation with CFS too which I think would have worked fine
> by the way.

This friction is nothing new. It's why we already find ourselves in the
unfortunate position of having a large corpus of out of tree scheduler patches.
If there is a lot of performance being left on the table, vendors are going to
find a way to get that performance. Corporations don't need our consent to ship
kernels with custom schedulers on their devices. They've already been doing it
for years, and it's ultimately the users who suffer.

I genuinely believe that the fair.c scheduler will benefit from being able to
apply ideas conceived in a sched_ext scheduler which end up working well for
general use cases. For example, in scx_rusty, we’re able to get very good
interactivity [0] by determining a task’s deadline as a function of its average
runtime (along with some other great ideas that Changwoo first added to
scx_lavd) rather than from its eligibility + slice as with what EEVDF does.
Over the course of a day or two, I tried way more ideas that didn’t work than
would have been possible in that time frame than with a recompile-reboot cycle,
and ended up finding one that seems to work very well. It would be awesome if
these ideas were added to EEVDF so that everyone can benefit.

[0]: https://drive.google.com/file/d/1fyHt9BYGha6apl7HAkibwpy52UTi8-AQ/view?usp=drive_link

Thanks,
David

Download attachment "signature.asc" of type "application/pgp-signature" (229 bytes)